Amazon’s Elastic Compute Cloud (EC2) is one of the most popular products on Amazon Web Services (AWS), used by 84% of companies on AWS according to 2nd Watch’s AWS Scorecard. Part one of this blog series described the top 12 challenges of monitoring Amazon EC2 when dealing with larger scale production deployments.
Although Amazon CloudWatch is a popular monitoring tool for AWS services by default, its limitations make it insufficient for most organizations trying to monitor EC2 at any kind of scale.
First, it only provides operating system-level metrics like CPU and memory—it does not offer insight into your application layer. Servers are often part of complex systems, and you’ll want to correlate between operating system-level metrics and applications or between multiple applications.
Second, CloudWatch only gives you two weeks of retention for your metrics data. It’s helpful to look back several weeks or months ago to see changes that happen over time and put patterns into perspective across deployments and system changes.
Third, CloudWatch only offers the ability to create simple dashboard widgets with a single metric or to set alarms with simple static thresholds. This is a good starting point for simple systems, but more complex systems and larger teams usually need more advanced analytical power. Analytics allows them to predict problems before they occur using calculated fields or to cut down on annoying false-positive alerts using dynamic thresholds.
Splunk Infrastructure Monitoring provides real-time cloud monitoring and intelligent alerting for all the services across your modern stack. It performs analytics on metrics as they stream from EC2, plus any custom metrics you designate, aggregated with metrics from the rest of your cloud infrastructure and services in your environment, with 13-month retention to see more changes over time.
A built-in EC2 dashboard significantly cuts down on your time to insight so you can start monitoring useful metrics right away without extra maintenance or configuration. Instant visibility gives you a great starting point for exploration and customization with metrics that matter to you. Here’s a snapshot of Splunk Infrastructure Monitoring's out-of-the-box dashboard for Amazon EC2:
It’s important to keep track of the instances you deployed in AWS to ensure system availability, performance, and cost-effectiveness.
Keep an eye on the number of hosts, especially if there’s recently been a big change. If you have too few hosts, you might be losing fault tolerance or even availability of services. If you have too many hosts, your AWS bills will be unnecessarily high. Big changes over time could indicate problems with auto-scaling scripts that have added or removed instances unexpectedly. It’s a good idea to set an alert to fire if the number of instances goes beyond a normal range.
Increasing the size of instances is one of the easiest ways to deal with capacity problems. Unfortunately, it comes with a cost and is easily forgotten when a temporary demand goes away. This can leave your team paying big bucks for large instances. Keep an eye on these instance types to make sure you’re not overspending.
A good fault-tolerant architecture will include multiple availability zones so that, even if one zone goes down, the other zones can take up the slack. It’s a best practice to make sure your instances are distributed across availability zones.
CPU usage is one of the most important performance metrics because the CPU tends to hit a limit before other resources. Additionally, if the CPU is exhausted, it may make your service unavailable or slow.
When you have dozens or hundreds of hosts, it’s nice to see the distribution of CPU usage across all of them in a single chart. The maximum is shown in dark pink, and you can see in the screenshot above that at least one server is pegged around 100% CPU. There is likely a problem with the server that needs to be fixed. You can also see P90 in light pink, and median in purple. Since the median in this chart is low, it shows that we have extra capacity on most servers.
Instances that are using near 100% CPU are likely near their capacity limits and may even be unresponsive or slow. This could be due to a bug in the code causing services or daemons to spin, temporary batch jobs, or even spikes in demand. If you expect these types of issues to continue, you will probably want to provision more capacity.
Narrowing down instances using the most CPU by image may help you identify the cause of the problem faster. If you recently upgraded to a new image, you can see if that image has a different performance pattern. Additionally, if your organization uses custom images based on service or application types, it will help you determine the root cause at an application or service level.
If you run an application that makes heavy use of disk I/O, you’ll want to keep a close eye on your disk metrics. Each volume has a limit on the number of I/O operations per second (IOPS), and throughput (bytes/sec). Once you hit the limit, your application performance will start to degrade. Additionally, each volume has a limit on the size, but these limits are not tracked in CloudWatch. You might consider adding a collectd agent to track volume sizes.
This chart shows the number of read operations in orange and write operations in blue. If your application uses a large number of operations per minute, you may want to consider using an SSD drive or use of provisioned IOPS. They are better suited for large quantities of random I/O and have fast read times.
If your application needs high data throughput, such as for reading and writing large files, you might want to consider using a throughput-optimized magnetic drive (HDD). These drives have higher burst throughput than SSDs, as well as higher block sizes.
It’s useful to see the change in disk usage today versus the same time yesterday. It can help you determine if unexpected changes in disk usage are due to changes made over the past day, such as new code deployments or changes in user demand. If your application has a steady baseline usage, you may want to alert if it exceeds a certain threshold.
Network bandwidth is another constrained resource on EC2 systems. Each instance type has a different limit on the bandwidth, with larger instances having higher limits. If your server has high network throughput and is not constrained on other resources, it may be constrained on the network. Also, keep in mind that if you’re using EBS backed volumes or making snapshots, they will also use network bandwidth.
These charts show a percentile distribution across your instances by network bytes per minute. The top chart shows network bytes in, and the bottom chart shows network bytes out. The dark pink is the maximum, and you can see that at least one instance is sending over 1 GB per minute. The light pink is P90, and the purple is the median. The spread between the maximum and the median is very high, indicating that a small number of instances is using much more bandwidth than the rest.
It’s important to see which instances are using the most bandwidth so you can determine if there’s a problem on one of those instances. You might expect to find high bandwidth usage on a web server, but probably much less on an LDAP server, especially in a smaller company. In fact, high network bandwidth going to your LDAP server could indicate a security problem. Additionally, if you’re looking to reduce your network usage, this gives you a good place to start.
Here, we can see the total network usage over the last hour in blue, as well as a red trend line showing the percentage change versus the same time a day ago. This gives you a good idea whether the changes in traffic are spiky or steady. You can also see whether it’s consistent with past usage. Large swings in percentage might indicate big changes in traffic or a problem that needs to be fixed.
Splunk Infrastructure Monitoring also collects additional metrics from CloudWatch, including the number of CPU burst credits available. When you have spikes or temporary needs for CPU processing, you can use your burst credits to process the data quickly. If you need more CPU on a steady-state basis, you will need to increase your instance size. We also track the number of network packets read and written over the network interface.
One of the most valuable sources of additional metrics is from the Splunk Infrastructure Monitoring collectd agent. This agent offers a variety of plugins that can track memory usage, page faults, CPU steal time, disk space, process statistics, and more. Memory, in particular, is one of the most constrained resources on servers, so it’s important to have visibility when you are hitting a limit. Check out some of the metrics collectd offers below:
SignalFlow is the advanced analytics engine that allows you to take standard metrics like the ones shown above, and derive new and more intelligent signals to monitor your systems and applications. It allows you to calculate new fields, to identify trends before they affect your user experience, compare different moments in time, and more. Furthermore, SignalFlow performs these calculations in real time so that your dashboards and alerts are timely and actionable.
For example, collectd’s memory plugin lets you report memory usage either as an absolute number or as a percentage. But what if you want both? Percentages are useful when you want to see whether a server is running below its maximum capacity or is at risk of running out of memory. If you know exactly how much memory each application uses, absolute numbers will help you determine how many more applications can fit on the server. You can also determine how much memory to add when upgrading the server.
Better yet, get the best of both worlds by calculating the percentage of memory used with SignalFlow analytics. The recording below shows the chart configuration screen I used this to set up the calculation. Line A is the memory used for each host, line B is the memory that’s free, and line C is the percentage of memory that’s used calculated as A/(A+B)*100.
I’m visualizing this data in two ways. The first is a line chart showing me the percentage of memory used across all my servers. This helps me get a quick visual of the distribution of memory usage across my servers and how it changes over time. The next is a list that shows me the hosts that have the top percentage of memory used. The hostname is contained in the name of the metric shown in the third column of the table below.
I can see that my EC2 servers kicked off some integration tests around 2pm yesterday afternoon. From that point they used more and more memory and now over 75% is used on some servers. By clicking on the alarm bell icon in the dashboard above, I can instantly choose an alert from a menu of Recommended Detectors, built-in and optimized for my specific environment, that will notify me when the percentage crosses a threshold, such as 90%. This will help me be more proactive in the future so I can manage memory usage before it impacts availability.
This is only one example of what you can do with SignalFlow analytics for Amazon EC2. You can create and customize any number of calculations on time series data aggregated across your services and see and alert on the results streaming in real time. Our integration with CloudWatch makes it so easy that it only takes a few minutes to try out, so get started now!
Thanks,
Ryan Goldman
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.