Why Is Log Data So Important In Observability?

By Ian Thompson

Imagine this scenario: your platform appears to have an issue. Maybe it has gone down or maybe it has affected a large volume of users or perhaps just a few of those important ones; either way there is a significant problem with it. Users are complaining and are happy to shout about the platform not working on X (formally Twitter). However, the traditional and silo monitoring tools tell a different story; the various dashboards are showing everything is green, the platform components appear to all be working and there is no issue to investigate.

So, why does this typical situation happen? Traditional monitoring approaches struggle in these environments - they are routed in polling of the standard typical metrics, say every minute, of a platform, sample transaction trace data which in some cases is only collecting about 5% of the data available and they do not collect the rich data contained in the logs. This approach has major visibility gaps and doesn’t contain the right data to identify if the platform is having an issue, hence the ‘everything is green on the dashboard’ situation above. They do not have the richness of data needed from the logs to troubleshoot issues quickly.

Observability, with OpenTelemetry, is the key to managing these digital platforms and it is based on the capture and analysis of three types of telemetry; metrics, traces and logs. Streaming metrics in real-time - thus eliminating the polling approach - combined with customised metrics that you choose, quickly identify if there is a problem within the platform. The full-fidelity no sampling tracing approach, meaning that all traces are collected and analysed, tells us where that problem is and logs tell us the root cause. That last type has become so important in today’s digital world; by harnessing that log data we can quickly find the root cause of issues affecting these platforms. It is the contextual information that is contained within that data that allows us to understand the true root cause.

What is perhaps less known is the ability to use that data to answer more questions about the platform ranging from why a system behaves the way that it does to insights that would significantly help technical, marketing and business teams. By adding AI/ML into the mix, we can use that contextual data and its patterns for prediction - like the potential cause of the next outage - and to find anomalies or patterns in the data. At Splunk we call this the snowball effect - as our customers use the platform to answer questions across correlated data, more questions are thought of, with more answers from the data and so on.

In this blog, I explore the typical challenges with harnessing that log data at an enterprise scale, given how large these environments have become and how Splunk’s expertise and longevity in managing both unstructured and structured data can be applied to the world of observability and power that all important logs pillar.

What’s A Log?

A log contains an alphanumeric string of arbitrary length which is generated by a computing or communication subsystem intended to convey information about the state or state transitions taking place within those subsystems. And it is this information that makes the use of logs and log data vitally important in today’s modern platforms. This log data though is not without its challenges; there are no standards or typical conventions for either writing out the data - developers and vendors each have their own approaches here - or for how that data can be ingested, interpreted and used. An error code in one subsystem may mean something completely different in another. Furthermore, logs vary from subsystem to subsystem, from application to application, or in the case of today’s modern platforms, microservice to microservice and they are often highly flexible and subject to different interpretations.

What Are The Challenges With Using This Log Data?

Harnessing the power of log data has been difficult for a number of reasons:

The unstructured data - when you begin to think about the data in these environments it is vast, varied and to the above point, with no agreed standards of how the data is stored - a simple IP address might be stored under many variations of field names within different data sets. As the data is typically unstructured, there are no structural clues as to what it means. Equally, the lack of structure means that there are no constraints on what they express. The challenge is how do you find a way of effectively extracting that mass of data to make it meaningful and contextual?
Collection - given the vast array of data sets and volume, multiple tools and agents are used to try and collect it, each with their own approach. This results in collecting only the important data that the tooling decides is important or can cope with, and significantly reducing the visibility that the data can provide. The data remains separate in each tooling or collection engine and is siloed.
Correlation - this is extremely hard when the data collected is stored across multiple tools and locations. How do you correlate data across these vast data sets and do this at scale? This also wastes valuable people resources in trying to do this manually and frequently won’t be done as it is too hard.
Storage - where and how the data is stored can be costly, particularly with the use of multiple tools. Furthermore, how do you protect that data against security threats?
Performance - speed of ingesting, processing and using the data is a big challenge, particularly at an enterprise scale.

These challenges all lead to the same typical outcome - this data is not used and therefore you miss out on the important and contextualised information that the data contains.

Logs, Data And Context…

A popular misperception in the world of observability is that only specific or certain types of logs are important, such as K8, Java or app logs. In fact, all the different data sets out there are hugely important including those customised data sets that you might have in your environment. So when we talk about logs here at Splunk, we actually mean all the varied data sets in the environment. This is important as true observability is powered by context and it is that contextual information that we derive from all this data. And that is why the third pillar of O11y is so key in providing context visibility into the platform to get to the root cause quickly.

So What’s Needed To Harness The Value Of Data And Turn It Into Contextual Information?

Splunk has over two decades of experience when it comes to harnessing the power of logs and other key data, from both an O11y and security perspective, with its award-winning data platform. This provides the power to the log pillar of Observability and it is fully integrated with metrics and traces - so you can easily drill down from either to aid in that rapid troubleshooting process to get to root-cause. The platform is built on the following unique principles:

Schema-at-read - right from the beginning, Splunk has built a platform where there is no requirement to understand the format of the data set before ingestion; simply ingest the data into Splunk and start using it.
SPL - over twenty years of building out the Splunk Programming Language so that you can quickly search and ask the data set questions, build out graphical visualisations, perform calculations on the data, augment the data to provide better context, use the machine learning toolkit to spot anomalies and predict the future trends of that data and much, much more.
Correlation - one of the biggest challenges of harnessing data from these platforms is being able to correlate different data sets together - for example, what if I wanted to find the same user ID in another data set? This is easy - using the schema-at-read approach - this data correlation is quickly done at search time, using SPL, by combining multiple data sets together.
Lots of data sets (what Splunk calls GDI - get[ing] data in) - Splunk supports any data into the platform and in any format. In fact, there are over 2000 technical add-ons and apps - which are technology vendor, Splunk and community written - available on Splunkbase which makes getting data in even easier to do. And not just from agents; the platform supports agentless approaches too, as shown below.

Storage - multiple options to ensure the most economical approach to storing data securely.
Performance - Splunk’s data platform uses patented technology to quickly ingest, process, store, analyse and view the varied data sets across the observed platform.

What Can I Do With This Data?

Once the data has been ingested into Splunk, you can do some great stuff with the data:

Speed up troubleshooting - this data will allow you to quickly identify the ‘why’ something has gone wrong in your environment. Are you seeing repeated references to K8 pod memory issues, references to objects that do not exist or those custom developer error messages? Is there a relationship between the users’ path through the app and the issue seen at the infrastructure layer? By having access to all this data, it is easy to pivot across to another correlated data source to see the root cause, rather than the symptom, of the problem.
Analytics - utilising the logs and data to provide key contextual information about a platform, both from a config as well as a behavioural perspective. We can use this data to derive insights from how the platform is used and whether that latest marketing event delivered the intended outcome through user intelligence into how they navigate the platform. Splunk’s SPL provides that deep analytical capability to analyse the data sets.
Security - this data can be reused to provide security visibility into the platform at the same time. Splunk has plenty of out-of-the-box security use cases to ensure that any threats to the platform are quickly identified and remediated thus providing DevSecOps!

Getting Data In:

Whilst it is great to simply ingest the data, what we have found is the need to be able to control, filter and process the incoming data before it is ingested into Splunk and this is where Splunk’s Edge Processor comes in. Splunk uses Splunk Processing Language as its syntax, which allows users to utilise one syntax which can be used across the board of data inputs as opposed to learning multiple technology languages. Edge Processor then takes it a step further in that it utilises SPL2, which is a fork of standard SPL designed to be used for big streaming data - for more info, on this exciting new feature, please click here. This allows users to easily filter out data that is not required, deduplicate data so that they are not doubling up on storage and perform processing prior to the data being ingested, essentially performing ETL during a stream, thus giving both faster visibility and more focused data in your O11y environment. Furthermore, Splunk provides an identical capability for metrics, using the Metric Pipeline Manager (MPM) technology to control and manage the ingestion of metrics.

Combining Logs And Data To Metrics And Traces:

The data is fully integrated into both metrics and traces, to provide a complete and granular view of the observed platform and is accessed within a single user interface. For example, once an issue has been identified through a metric, you can click and drill down into the relevant log data to obtain the contextual information and root cause as to why the issue occurred. Likewise for traces - the automatic filtering of the relevant log data enables an easy drill-down so that root cause can be quickly obtained and accelerate the mean time to resolution process. We can equally go the other way too; from a selected log back to the corresponding trace or metric to provide that additional context of what was happening during the execution of a request - which microservice and code version did the request hit - or how the system was performing from a metric perspective like a noisy neighbour incident with the pods. Splunk can also generate metrics from those logs - which allows you to track key info from the data and report on it. Splunk’s ML/AI engine can also analyse that data to identify anomalies, and baseline performance, assist with troubleshooting by identifying the root cause and predict performance based on what has happened before.

In Summary

Harnessing the power of this data is key in the visibility of the modern platforms and will accelerate the ability to troubleshoot and resolve issues much faster, thus ensuring that the platform meets the 3 and 4 9’s of availability and allowing the development team to focus on innovating rather than troubleshooting and fixing code.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.