Amazon Elastic Load Balancing (ELB) allows websites and web services to serve more requests from users by adding more servers based on need. There are several challenges to operating load balancers, as discussed in a previous blog post: Microservices Load Balancing: Navigating the Waves of Modern Architecture. An unhealthy ELB can cause your website to go offline or slow to a crawl. The right dashboards and meaningful metrics provide insights to remediate issues faster, and a powerful analytics engine makes alerts smarter.
A load balancer distributes load across all your servers to ensure even usage of capacity, taking into account the type of services offered by each server, whether each server is healthy, and the demand on the server. One key benefit of load balancing is that it provides your website with fault tolerance. If any of the servers is unhealthy or encounters a critical error, the load balancer will stop routing to that server and deliver requests to healthy servers instead. This makes your application or website more reliable because it can adapt to failures while still delivering a good user experience.
The ability to grow by adding more servers is where ELB gets the name Elastic. You can automatically add more servers based on demand, also known as autoscaling. This can help prevent the “hug of death” when an app experiences a sudden spike in traffic that your existing servers cannot handle. Rather than getting an alert at odd hours of the day to manually add servers, you can rest assured your app will react automatically to deliver good service.
Amazon offers two types of load balancers, each with its own strengths.
1. Classic load balancers have been available for many years. They can balance both HTTPS and TCP traffic, and they have basic health check and routing abilities.
2. Application load balancers (ALBs) are a new type recently released by Amazon. They offer the ability to route requests based on their content, which is great for applications that comprise several containers or microservices per host. ALBs also add support for HTTP/2, WebSockets, and offer enhanced metrics for monitoring.
Problems with ELB that may cause an outage include configuration errors with the load balancer, network or security settings, and problems with your backend service. Your monitoring tools will give you the information needed to troubleshoot and fix these issues, but the type of data and the speed to insight will vary greatly depending on which tool you use.
Amazon’s standard monitoring tool for AWS offers basic data about the number of healthy hosts, latency, number of requests, error rates, and more. These metrics can be plotted on dashboards, and you can create basic threshold-type alerts.
Amazon CloudTrail will give you an audit log of any API calls made on ELB, including creating or deleting a load balancer and changing the configuration settings. If you think that a user or a script may have made changes leading to a production problem, CloudTrail gives you the information you need. These logs are stored in an S3 bucket, allowing you to analyze them using the Amazon Elasticsearch Service or your favorite log management solution.
Access logs provide a record of all individual requests made to the load balancer, including the status code for each one. Where CloudWatch metrics offer aggregate-level information, access logs provide the individual records for a more detailed analysis. It stores these logs in an S3 bucket just like CloudTrail logs.
Now that we’ve covered the basics of ELB, let’s take a look at the top metrics to monitor to maintain a healthy ELB.
1. Number of Load Balancers
The number of load balancers typically doesn’t change much over time, so it’s shown as an easy-to-read count instead of a timeseries. Additionally, it gives you context to understand the distributions and rankings of load balancers shown later.
2. Latency Over Last Minute
It’s important to keep a close eye on latency because it’s directly tied to user experience. If your load balancing latency is too high, your application or website could frustrate users, and you might lose opportunities and miss SLAs.
If you only looked at average latency over time, you wouldn’t see crucial details about user experience. We recommend visualizing latency as percentiles so you have more insight into the best—and worst—performers. In the chart above, you can see the maximum latency in dark pink, the 90th percentile in light pink, and the minimum in green. The max latency of 20 seconds spiked around 14:00, but the 90th percentile was flat, indicating that this was a temporary spike affecting less than 10% of requests.
3. LBs with Worst Average Latency (ms)
When you’re troubleshooting a latency problem, the next thing you need to see is which load balancers are affected. In this case, you see that lb-app-bb has the greatest latency. To explore latency more deeply, you could look into the access logs to see which specific requests on that load balancer were slow.
4. Total Requests/min
This chart shows a total sum of requests across all load balancers. An increase in requests per minute might be correlated with an increase in latency. Also, you’re probably used to operating your service within a certain range of requests per minute. If this number is way outside of that range, there are likely problems with routing or upstream on the client side.
5. Requests/min
Latency problems and errors can sometimes be explained by spikes or increases in traffic over time. This chart shows how requests/min is distributed among the different load balancers. Essentially, this helps answer the question of how balanced the system is at a specific load balancer level. Here, we can see the band from min to max is narrow, which is good. If the band were wide, that would mean different load balancers are fielding different numbers of requests/min.
6. Top LBs by Requests/min
When the number of requests is higher than you usually expect, or there’s a spike in requests, the next step towards narrowing down the cause is to determine which load balancer is affected. Here you can see that lb-ingest is taking the brunt of the traffic. You actually expect lb-digest to take way more traffic than your lb-intern, so this chart is showing normal behavior.
Alone, any of these Requests/min charts can indicate a moment-in-time issue that could be cause for concern. But looking at them collectively as part of a dashboard and applying solid alerting logic provides a stronger case for identifying a problematic trend and quickly isolating the cause before it leads to a load issue that impacts performance.
7. Top Frontend Errors/min
Frontend errors are defined as the errors returned back to the client from the load balancer. The load balancer will actually retry calling the server in case of an error. Here you can see that there are no frontend errors, and the service is responding as expected.
8. Highest Backend Error %
Backend errors are defined as errors between the load balancer and the server. When an error code is returned from a backend server call, ELB may retry the call. Additionally, this count includes errors returned during health checks. Thus, the number of backend errors may be higher than the number of frontend errors. In this case we see that the load balancer with the highest percentage of errors is lb-app-bb.
9. Top Backend Connection Errors/min
This chart shows the aggregate count that corresponds to the percentage in the prior chart. The percentage can be misleading if the count is a low number, so it’s useful to make sure the aggregate has enough data to make the percentage meaningful.
10. LBs with Highest Unhealthy Host %
It’s important to keep track of the percentage of hosts that are unhealthy for each load balancer. If this number reaches 100%, you’ll likely have a complete service outage. Additionally, you can act proactively when you notice there are too many unhealthy hosts. This may give you time to fix the issue before end-users are impacted. In the example below, the intern load balancer has a slightly higher percentage of unhealthy hosts (but it’s a test system).
11. Requests/min 7d Change %
Looking at long-term patterns can help troubleshoot issues that repeat over time due to daily or weekly trends. For example, if you notice that latency is increasing in parallel with the request rate, it may be due to a server load issue.
12. Latency 7d Change %
Latency over the last seven days can be another helpful way to determine if there has been a pattern due to daily or weekly trends. You can compare this with the requests over the last seven days in the previous chart to see if they correlate. Additionally, you may have daily or weekly batch jobs such as deployments or cleanup jobs that affect latency. Deployments are especially important to watch because new versions of your service could cause slower or faster performance.
Splunk Infrastructure Monitoring ELB Dashboard
Splunk Infrastructure Monitoring offers a dashboard out of the box that shows you the most important ELB metrics at a glance. This removes some of the guesswork typically associated with new service adoption and will help you identify and fix problems faster. Going beyond basic CloudWatch metrics, Splunk Infrastructure Monitoring shows aggregate data from many load balancers and instances and provides rankings and distributions as well as comparisons over time.
One of the most powerful features Splunk Infrastructure Monitoring offers is the advanced SignalFlow analytics engine. SignalFlow analyzes raw metrics in real time as they stream from your environment to help you visualize, understand, and act on the conditions of your services when they matter most. Instead of alerting reactively after a problem has already happened, you can predict it based on insightful analytics like rate of change or variance and take steps to proactively avoid an issue before end-users are impacted.
One really good predictive measure for load balancers is the unhealthy host percentage. If you fired an alert on a simple threshold of unhealthy host count, your alerts would probably not indicate cluster health when the cluster autoscales over time. If you lost one host in a scenario where the group size were just one host, it would be a catastrophic outage. But, in reality, the group size is more likely 10 hosts, so losing one isn’t an impactful loss. However, most alerting tools still prioritize health checks and host or node up/down. Anyone who has ever been on-call knows the pain of getting paged at 2am, just to learn the alert was practically useless.
A smarter alert would take into account the size of your cluster and calculate the percentage of hosts that are unhealthy. To set this up in Splunk Infrastructure Monitoring, we can open the chart titled LBs with Highest Unhealthy Host %. By changing it from a list to a timeseries graph showing trend over the past week, we can immediately see that we lost about half the hosts for the lb-lab-se load balancer. While it wasn’t a complete outage, an admin would want to be notified the next time this happens in order to take appropriate action.
You can see how we’re able to calculate the unhealthy host percentage in the section below the graph. Line A is the healthy host count summed by the name of the load balancer. Line B is the corresponding unhealthy host count. We can then calculate the percentage in line C as B/(A+B).
Next, we can create an alert detector based on this calculation and set a threshold so that an operations team would get an email if the derived metric ever goes above 50%.
Taking this to the next level, many types of changes that affect the cluster take place gradually, one node at a time, so that the service stays available even as servers are cycled out. If you notice a trend that the servers are becoming unhealthy you can fire an alert and stop the rollout before the service becomes unavailable. You could calculate the change over time for the unhealthy percentage and predict an eventual outage.
Amazon ELB can improve reliability and performance for your services by adding servers elastically based on load and distribution. Paired with Splunk Infrastructure Monitoring's powerful monitoring solution, you can trigger meaningful alerts to proactively address issues and troubleshoot your ELB instance before latency or an outage affects customers.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.