Today, much of our online world is powered by cloud computing, and Amazon Web Services offers an amazing depth and breadth of available services. However, most of the time it starts with Amazon Elastic Compute Cloud, EC2.
EC2 is powered by virtual servers called instances and allows users to provision scalable compute capacity as desired. This means no server hardware investment and the ability to scale up or down in response to demand (thus elastic).
Launching an instance requires an Amazon Machine Image, AMI. Amazon has provided a number of pre-templated AMIs or you can create your own. An AMI can launch one or multiple instances as desired.
Each AMI must include:
Since EC2 instances can be used to increase or decrease resource capacity, you can match demand. And since they can be spun up in different geographical locales worldwide (Regions in AWS terms), they can be tailored to meet regional needs. EC2 also ties into other Amazon services, including EBS or in conjunction with containers, into ECS or EKS.
But all this abstraction means you’re a step (or two) further away from the work. This means monitoring plays a major role in your understanding of just what EC2 is doing and how your applications are performing, both for your customers and for your AWS costs.
Monitoring is of course important to the reliability, availability, and performance of your EC2 instances. But before we dive into the metrics to monitor, let’s talk about building your monitoring plan.
Monitoring plans aren’t unique to AWS, but they help you establish your goals and identify success criteria. A monitoring plan should include these factors, plus:
Of course, there are others, including the more detailed technical aspects of our infrastructure and applications, which we will cover next.
Monitoring EC2 is a lot like monitoring any computing environment. It’s infrastructure monitoring with an elastic twist. Generally, our EC2 metrics will fall into one of three categories; CPU, Network, Storage (Disk I/O).
You’ll likely find the USE (Utilization-Saturation-Errors) monitoring practice to be of value, as matching utilization and saturation can give you a clearer concept of how your overall resources match the demands. You can also use certain automated tools from Amazon to watch your EC2 environments and let you know something is wrong.
There are several ways to gather your metrics from EC2. AWS CloudWatch is obviously well known, with limitations on how often the data is reported. Selected metrics are available through CloudWatch Streaming Metrics, which removes these limitations. Finally, using an agent like the Smart Agent will give you even tighter resolution, along with an improved scope of coverage and control. Of course, you can only use Smart Agent where you have direct control over the software installed on the instance. While EC2 certainly gives you that access, other AWS services only report via CloudWatch, so your best practice would be to use both.
As you know, EC2 instances can cover a very wide range of vCPU configurations. Tracking your utilization can help you ensure your instances are the right size and scope for your workload. It's important to remember that the reported metric is for your virtual machine, not the underlying physical device. Some configurations are designed for burstable behavior, meaning they can temporarily spike to deliver more computing power, based on earned CPU credit.
The primary CPU metric of interest is CPUUtilization, reporting on the percentage of CPU in use. It could be normal to run at a high utilization rate – in fact, you probably want to use the computing power efficiently. But spikes and dips happen, due to seasonality or to problematic occurrences. Here’s where our baseline comes in, so we have some concept of normal behavior and can plan for expansion, be it resizing our instance or making use of the burstable capability of certain instances. We can also measure the CPUCreditUsage, our CPUSurplusCreditBalance, CPUSurplusCreditsCharged and our CPUCreditBalance.
EC2 storage comes in two basic forms, EBS (Elastic Block Storage) volumes and instance store. Since instance store is ephemeral and not all instances support it, we’ll start with EBS. Like most things in AWS, the type of EBS storage can vary widely. You can use solid-state drives or traditional hard drives. These can vary by number, capacity and performance so monitoring your use and your peak can help make sure that the disk type shown delivers the IOPs and throughput you need. It is important to note that CloudWatch only delivers data from instance stores, excepting a couple of instance types. You’ll need to grab EBS disk I/O via the Smart Agent or from the CloudWatch EBS metrics.
Our primary metric is total IOPs but with so many variants, we may want to drill into specifics. We may want to understand write behavior, EBSWriteOps and EBSWriteBytes or reads, EBSReadOps and EBSReadBytes. Certain smaller instances may make use of EBSIOBalance% and EBSByteBalance% relating to the burst balance and can help you correctly size your instance. Instance types C5 and M5 include these in their CloudWatch metrics and are always available via the Smart Agent.
Instance store data is similarly structured, with DiskWriteBytes and DiskWriteOps for the write side.
DiskReadBytes and DiskReadOps supply information for the instance store read operations.
In all cases, you need to pay attention to your storage use, particularly EBS, as losing data is never a good thing.
You would expect networks to be pretty important in a cloud world and you’d be right. Not only are all of our communications via the network but our storage (EBS) also relies on the network. All of the services you use are equally dependent on networks, and can even cross availability zones. EC2 instances vary in their network bandwidth limits as well as their Maximum Transmission Unit (MTU). Monitoring your network will help you optimize your network performance as well as to detect and respond to issues.
Here our principal metrics of interest are NetworkPacketsOut and NetworkOut, along with NetworkPacketsIn and NetworkIn. These are the total number of packets and bytes respectively for all of the network interfaces. Using the Smart Agent will give you the ability to dive into each instance individually.
With both the overview data you can determine if you have network problems (potentially causing slow consumer issues) or if you have unexpected loads.
Now that you’ve seen some of the things you can monitor in EC2, you can get started on your own monitoring approach. Start by obtaining a baseline for your system, since you need to know what normal is before you can tag outliers. Monitor crucial metrics, like CPU Utilization, network utilization and disk I/O across your instances and store a history of those metrics. You will need to grab certain metrics that are related to the operating system choice. You may want to grab log files and extract data from those as well.
Remember to track your metrics across various times and workloads. When establishing a baseline, there is no such thing as too much data.
Once you have a solid grasp of your baseline, you can set up detectors to alert on anomalies and plan practices to address them. You’ll also be able to understand your cost for use patterns, always useful in cloud environments.
There are a lot of resources you can check out for more information on monitoring EC2:
But it doesn’t stop here. You can find out more about how Splunk Observability works with AWS to get you meaningful insights.
And get started with a free trial and start monitoring EC2 instances with Splunk Infrastructure Monitoring today.
----------------------------------------------------
Thanks!
Dave McAllister
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.