On our cloud-native journey, we live in a containerized world. Our environments are containers, managed by orchestrators, and living on some level of computing clusters. Of course, that means you are also responsible for managing all those bits, right?
Well, not totally. Thanks to new technologies like AWS Fargate, you’re not on the hook for much of the underlying structure. You don’t need to provision, configure or scale the VMs to run your containers. You simply tell Fargate what you want and it automatically sets it up for you. No more worrying about capacity planning or other operational needs. Similar to working with serverless (AWS Lambda), you can run in a hybrid model, where part of your work is on traditional EC2 and part moves to Fargate.
But how do you make sure that what is going on under the surface maps to the performance and issues in your app?
Monitoring Fargate lets you understand how your containerized apps are performing. And since in the Fargate model you are paying for your usage, keeping track of your Fargate deploys makes solid financial sense.
Fargate aligns with AWS ECS, the AWS service that lets you manage the configuration and lifecycle of your containers. Both run in the cloud and are tied into many other services. Linux containers can launch on Fargate resources or your own EC2 instances or both, but Windows containers are limited to EC2. ECS groups containers into tasks. That task has a definition that sets the details, including size, the CPU and memory required, among other options. Keep in mind that you must specify CPU and memory in your task definition file and that they have to align to one of the Fargate combinations or else you’ll get an error returned.
You can also choose to define other objects, which allow you control over the sharing of resources, such as memory. For instance, memoryReservation specifies the minimum memory for the container, and memory is the maximum. Containers can grow from their minimum if additional memory is available, but should a container try to use more memory than is set in the maximum, ECS will terminate the container. Similar rules exist for CPU with minimum and the ability to use additional CPU resources if available. When setting allocations, it’s worth keeping in mind is that ECS allocates CPUs as CPU units (or shares of a virtual CPU). Each vCPU equals 1024 CPU units.
Similarly, Fargate works with Kubernetes (EKS in this discussion). Since EKS is Kubernetes-conformant, moving from any Kubernetes orchestration to EKS should be painless. However, the terminology used by EKS is different from ECS. A cluster is the group of instances that use the same Kubernetes API server. A pod contains 1 or more containers. A deployment defines the collection of pods. There are similar definition items (memory, CPU, IP address, etc.) that can enforce limits on the resources for the pods and containers, thus impact on the applications.
However, if you are used to using DaemonSets, you are out of luck. Fargate does not support DaemonSets.
So we have lots of options to get things running and they can cross from EC2 to Fargate. But what can we measure to make sure Fargate is delivering the goods for us?
Monitoring Fargate is relatively straightforward. We need to monitor the resources we’ve defined to make sure we aren’t overprovisioned or being starved for resources. After all, nothing is quite as painful as having a great app become a slow consumer due to memory pressure, be it by underprovisioning or due to noisy neighbors. This focuses us on memory and CPU for each task or pod. However, we also need to understand our costs, which is based on the number of tasks (or pods) run, the amount of time running and the resources allocated to each. AWS has some great pricing information to help you understand Fargate pricing. Our monitoring will help you determine if you can optimize your spending.
Let’s start by considering modern monitoring practices, USE and RED. RED, Rate-Errors-Duration, is focused on applications, particularly on applications that are made up of multiple services. USE, Utilization-Saturation-Errors, is focused more on the infrastructure side. While it is strongly recommended to have both methods in hand, USE is our focus point for Fargate (ECS, Kubernetes) monitoring.
In general, our monitoring will focus on metrics that come from 3 major categories (and one catch-all).
As you recall, we defined a number of CPU resources for our tasks or pods. Since the initial instantiation is based on our reserved minimums and hopefully has growth capacity, it’s important to monitor to make sure we are getting what we expect.
Name | Description | Metric | Target |
ECS CPU utilization | # of CPU units currently in use | Utilization | |
ECS CPU reservation | # of CPU units reserved for the running tasks | Variable for saturation calculation | Splunk Infrastructure Monitoring |
EKS CPU utilization | % of Fargate compute available resources in use | Utilization | |
EKS CPU request | The requested CPU (units) resources for each container | Utilization | Splunk Kubernetes Navigator |
EKS CPU limit | The max (limit) of CPU (units) for each container | Variable for saturation calculation | Splunk Kubernetes Navigator |
EKS CPU allocatable | The amount of allocatable CPU (cores) available | Variable for saturation calculation | Splunk Kubernetes Navigator |
Of these metrics, our focus should be firmly on CPU utilization, to help ensure that spikes don’t breach your hard CPU limits. Depending on your specific risk factors, set your high alert to inform you of static breaches of 80% (or higher). Missing a 100% breach can have consequences that last long beyond the immediate breach, so a full-fidelity monitoring system is your friend here. You might also want to set a sudden change alert, making use of AI/ML capabilities to help with those factors that aren’t known (yet) but can cause your container’s CPU use to deviate in unexpected ways.
Likewise, it is suggested to set some sort of monitoring and or alerting on unusually low conditions. If your workloads are under 50% consistently or show a seasonal behavior of under 50%, consider ways you can reset and re-establish your definitions to match your workload better. After all, you really don’t want to pay for unused resources, but be sure to take into account all the potential activity you might face.
Remember that we had to specify the memory requirements for our containers/pods. It’s a fine line to walk, between making sure we have enough memory and available pool to run our workloads efficiently and that we aren’t overprovisioning and thereby upping our costs. Keeping an eye on memory usage (with a bit of alerting assistance) will help us understand usage and/or potential impact.
Our memory metrics of interest are:
Name | Description | Metric | Target |
ECS memory utilization | # of memory in bytes currently in use | Utilization | |
ECS memory reservation | # of memory in bytes reserved for the running tasks | Variable for saturation calculation | Splunk Infrastructure Monitoring |
EKS memory utilization | % of Fargate memory available resources in use | Utilization | |
EKS memory request | The requested memory in bytes resources for each container | Utilization | Splunk Kubernetes Navigator |
EKS memory limit | The max (limit) of memory in bytes for each container | Variable for saturation calculation | Splunk Kubernetes Navigator |
EKS memory allocatable | The amount of allocatable memory in bytes available | Variable for saturation calculation | Splunk Kubernetes Navigator |
Similar to CPU, we are most focused on memory utilization. This is especially important because exceeding 100% memory utilization will lead to Fargate killing our container. By aggregating and analyzing our memory usage, we can determine if our memory requests are within an acceptable range. It can also help us determine if we have any cyclic behaviors which might impact our requested mins and maxes.
If you are regularly using less than your minimum request, reduce it. That should reduce costs but keep in mind that Fargate may move your workload to a smaller compute resource, so you also will need to observe for any potential impact.
Setting alerts on memory utilization should be approached similarly to the CPU metrics we set above. Set one for a static high value in alignment with your risk profile, set one for a sudden change or seasonal value, and consider setting an alert on low memory utilization. When choosing your high alerts, keep in mind that a container (or pod) that asks for more than its maximum allowed memory will be killed.
A small side note: Memory can have an impact on application performance. Too little memory may cause a resource starvation impact, unnecessarily slowing or stopping applications and resulting in unhappy users. Applications, in turn, can suffer from memory leaks, which will potentially eat memory until failure. Having APM (application Performance Monitoring) aligned and in step with your Fargate monitoring can quickly reveal the true impact memory choices have on your environment.
Even though you are tracking the individual elements, you also need to keep an eye on the Fargate cluster usage. Currently, you cannot exceed a hard limit per region:
100 > #ECS tasks + #EKS per region
Fortunately, there are metrics we can watch to help us keep things under control.ods
(or in words, the number of ECS tasks added to the number of EKS pods cannot exceed 100 per region)
Name | Description | Metric | Target |
ECS current task count | # of tasks in cluster: desired, running, pending | Utilization | |
ECS service count | # of services currently running | Utilization | Splunk Infrastructure Monitoring |
EKS current pod count | # of available pods | Utilization | |
EKS desired pod count | # of desired pods | Utilization | Splunk Kubernetes Navigator |
Per the earlier comment, we do have some hard limits. So setting an alert on the sum of ECS tasks and EKS pods makes good sense. Likewise, keeping an eye on a continual dashboard will help you understand your changes over time and potential seasonality issues.
Also, Fargate space is elastic and it will attempt to provide a resource for any new task or pod you launch. By using the desired state metric versus the current count, you get a heads up on potential errors in creation.
So far we’ve looked at the direct resources that make up our Fargate compute environment. However, AWS services don’t often stand alone. We need to keep an eye on other services, like databases, NGINX, load balancers and the like.
Both ECS and EKS support persistent storage via Amazon EFS. Keeping tabs on your storage is of major importance to the health of your environment and apps as well as a principal part of USE monitoring. Similarly, networking plays a major role in modern applications, so keep an eye on the network metrics (dig into the VPC logs, for example)
You might choose to run your Kubernetes job in EKS on Fargate using StepFunctions. You might also want to send your Fargate logs to Splunk. Fargate is amazingly flexible and is designed to work with minimal or no changes to your current process.
You can use Amazon CloudWatch Container Insights to get visibility into ECS and Fargate. As the native monitoring AWS service, Cloudwatch is your go-to for the primary metric info. As a partner with AWS on the AWS EKS Distro, we’ve made it simple to monitor Kubernetes on AWS as a straightforward turnkey solution, including the metrics that will give you the best insights.
We’ll be covering Fargate in more details in our upcoming webinar “Scaling Kubernetes with Splunk and AWS.” Join us live or drop in on the recording after the event.
But it doesn’t stop here. You can find out more about how Splunk Observability works with AWS to get you meaningful insights. Getting started with monitoring for ECS / EKS / Fargate is straightforward.
Get started with a free trial and start monitoring Fargate deployments with Splunk Infrastructure Monitoring today.
----------------------------------------------------
Thanks!
Dave McAllister
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.