Systems going down because of an unforeseen incident? Got problems with your app or website? Is your audience missing out on products and services because your load times are too slow?
Then monitoring and observability (and telemetry) should be of interest to you!
In this long article, we’re covering everything! I’ll start with the concepts and how they work. Then I’ll move onto the real-world stuff that brings it all together — tools and examples so you can ensure the reliability of all systems that power your business.
The quick summary:
Keep reading for more in-depth, expert information.
A simple concept that’s sometimes harder in practice, IT monitoring includes any activity that supports and ensures digital equipment and services are working properly. Monitoring helps IT professionals to detect issues — and possibly help resolve them. From a systems standpoint, monitoring can help with anything that you might ask, “Is this [system, app, network, etc.] working correctly?”
IT monitoring is a catch-all phrase, but the monitoring activity gets more specific depending on the specific use case (area) you need to monitor. Overall, IT monitoring can play a part in all areas of digital and IT services:
Though it will always depend on the area you’re monitoring, we can sum up monitoring as collecting and analyzing predefined data types (network bandwidth, CPU utilization rates, etc.) in order to detect abnormal behaviors that might indicate problems.
(Get all the details in our IT monitoring explainer.)
So, what sorts of IT areas can you monitor? Well, practically everything! See if your website is up, make sure your infrastructure has capacity for all its workloads, ensure APIs are responsive, identify security risks. You can monitor all these particular areas and probably a lot more:
Heck, you can even monitor your AWS environment and Kubernetes environments with these tips and metrics.
Within monitoring, we can sum up the types of tools into three main types:
(Yes, Splunk has a variety of monitoring tools. Explore them now or read on to the observability section for a fuller understanding.)
So, yes, monitoring has been around for decades and today it remains important. But with distributed systems (and distributed workers!), traditional monitoring does have clear limitations.
Today, most enterprises are using containers, microservices and Kubernetes in some capacity — these cloud-native technologies enable flexibility and agility and accelerate time-to-market. But, of course, they are too complicated for legacy monitoring approaches. There’s a few reasons for this, as Spiros Xanthos describes:
Now, let’s turn to observability, which specifically aims to address these legacy challenges.
Where monitoring is an action you take, observability is an overall function or property of a system. The more you can observe a system, the more you can understand the complex ways. We no longer have to assume that various integrated services are a “black box” that we cannot see into.
But I encourage you to think more creatively about what this can mean. As Greg Leffler, head of observability practitioners, puts it:
“Observability is a mindset that enables you to answer any question about your entire business.”
Monitoring contributes to a system’s overall observability. So, with monitoring, you might be asking “is an individual piece (network, website, application or other service) up and running as expected?” With observability, you’re asking a bigger question: “How well is everything working?”
Previously, monitoring might alert you that your server’s CPU is spiking...but it can’t tell you which pod or container to go to, let alone if the spike is even something you need to worry about. So, no longer do we have to say “this system is too complicated to understand”. With observability, we can know so much more.
One real-world example shows exactly this difference: PUMA uses Splunk to do a lot more than simply knowing if their sites were up or down. After all, uptime is the starting point — uptime alone doesn’t make a website or business succeed.
“Before using Splunk, PUMA’s basic monitoring capabilities could only indicate whether its e-commerce sites were up or down. This meant DevOps and business teams couldn’t detect critical issues that caused failed orders, such as unresponsive inventory systems or declined credit cards. The result was a significant number of missed sales opportunities.”
Observability relies on external outputs. Yes, observability does rely in part on your monitoring practices and chosen metrics. It can also see “unknown unknowns” that monitoring cannot see. Let’s see how this came about.
Like monitoring, the concept of observability has been around a long time. Observability dates to academic research from the 1960s — but it has only been much more recent that observability has entered into the wide world of IT. We can point to two key drivers in the “sudden” interest in observability over the last decade or so:
Ultimately, observability can help system administrators understand unpredictable situations, most common in the distributed systems enterprises today support.
For a system to be observable it requires two things: plenty (!!) of data as well as the tools necessary to aggregate and operate on that data.
Observability relies on three types of telemetry data: metrics, logs and traces. With this information, teams see deeply into complex systems, allowing them to investigate the root cause of many, many issues — that alone, monitoring wouldn’t point to. When a system is truly observable, teams can…
You’ll often hear the word “telemetry” associated with observability and monitoring. This is not a separate concept, but a supporting one: telemetry data is what enables a system to be truly observable. Telemetry data refers to the logs, metrics and traces in observability — what is sometimes called “the three pillars of observability”.
It’s important to understand that telemetry data enables a system to be observable — but these three items alone do not add up to observability. For that, we want to look at additional features we can layer in.
(Read the expert definition of MELT: metrics, events, logs & traces.)
When moving from monitoring towards observability, you don’t have to tear down everything and start from scratch. You could decide to take what you’ve already had and complement them with in-house or open-source software to bring them to an observable state. Of course, you can also look into an end-to-end observability solution (more on that later). So, what goes into a truly observable system?
Typically, four components are required to implement true observability:
If there’s one sentence to sum up all the benefits of observability, it’s this: Cloud complexity is easier to handle when you have true observability. Organizations today have hybrid architectures across the multicloud, plus hundreds of microservice-based apps. Complexity with little visibility — talk about burnout for every single one of your IT workers.
We conduct annual research into the global state of observability. Our most recent research from 2022 indicates that companies that lead at observability see benefits like:
Maturing in these areas also has knock-on effects like achieving true digital transformation, building resilience and attracting and retaining top talent.
Observability isn’t limited to one single improvement area—nor is it limited to helping a certain set of stakeholders. When you’ve matured to a truly observable IT organization, you can see benefits in all sorts of areas, including:
Observability products are designed to help developers, IT teams and other stakeholders monitor and manage complex systems, apps and infrastructure.
These companies today offer the most well-known observability solutions, all with their own features and capabilities — and inherent limitations. For example, some solutions focus solely on cloud-native environments, and others offer only distributed tracing or log analytics. Not all offer real-time streaming, either. Common observability tools on the market today include:
Depending on the specific needs and requirements of a particular organization, one or more of these observability products may be useful for improving visibility and managing software systems.
With Splunk Observability, you’ll solve problems in seconds. Our observability solution is the only solution available today that’s full-stack, analytics-powered and OpenTelemetry-native.
Splunk Observability has all the must-haves for observability: instrumentation, data correlation, root cause analysis, automation and machine learning. It also offers some features a lot of other do not have:
Real time streaming. Today, the difference between minutes of latency and seconds can mean a lot. Splunk Observability is built on real-time streaming architectures, enabling you to detect and alert critical patterns in mere seconds—no matter the data format or data structure.
Massively scalable. For large organizations and global enterprises, scalability is essentially. Splunk Observability meets your needs no matter how large or how complex those needs are. How much scalability, you ask? Petabyters of daily log ingest and millions and metrics and traces per second—with no performance or response decreases.
With Splunk observability solutions, you can:
Try Splunk Observability Cloud for free
We’ve touched briefly on the OpenTelemetry framework. Because no commercial vendor has a single platform for collecting data from every one of your applications, OpenTelemetry was developed to solve this problem. This framework standardizes the way telemetry data is collected and moved to data platforms, like Splunk.
In addition to this major problem solve, OpenTelemetry has some knock-on benefits, too:
(BTW, we’re proud to have donated the OpenTelemetry eBPF collector.)
To illustrate how observability moves far beyond monitoring, let’s look at Rappi, who successfully maximized observability. With the global pandemic, Rappi saw a 300% surge in on-demand orders across 250+ cities in Latin America. Today, they service 7.5 million active users per week.
So how do they ensure their mobile app, infrastructure and backend services stay available and reliable for their customers? They turned to Splunk Observability Suite to:
As you can see, moving to observability is a journey that results in overall business growth and resilience.
For self-service support with Splunk monitoring and observability, check out these resources:
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.