Imagine this scenario: your platform appears to have an issue. Maybe it has gone down or maybe it has affected a large volume of users or perhaps just a few of those important ones; either way there is a significant problem with it. Users are complaining and are happy to shout about the platform not working on X (formally Twitter). However, the traditional and silo monitoring tools tell a different story; the various dashboards are showing everything is green, the platform components appear to all be working and there is no issue to investigate.
So, why does this typical situation happen? Traditional monitoring approaches struggle in these environments - they are routed in polling of the standard typical metrics, say every minute, of a platform, sample transaction trace data which in some cases is only collecting about 5% of the data available and they do not collect the rich data contained in the logs. This approach has major visibility gaps and doesn’t contain the right data to identify if the platform is having an issue, hence the ‘everything is green on the dashboard’ situation above. They do not have the richness of data needed from the logs to troubleshoot issues quickly.
Observability, with OpenTelemetry, is the key to managing these digital platforms and it is based on the capture and analysis of three types of telemetry; metrics, traces and logs. Streaming metrics in real-time - thus eliminating the polling approach - combined with customised metrics that you choose, quickly identify if there is a problem within the platform. The full-fidelity no sampling tracing approach, meaning that all traces are collected and analysed, tells us where that problem is and logs tell us the root cause. That last type has become so important in today’s digital world; by harnessing that log data we can quickly find the root cause of issues affecting these platforms. It is the contextual information that is contained within that data that allows us to understand the true root cause.
What is perhaps less known is the ability to use that data to answer more questions about the platform ranging from why a system behaves the way that it does to insights that would significantly help technical, marketing and business teams. By adding AI/ML into the mix, we can use that contextual data and its patterns for prediction - like the potential cause of the next outage - and to find anomalies or patterns in the data. At Splunk we call this the snowball effect - as our customers use the platform to answer questions across correlated data, more questions are thought of, with more answers from the data and so on.
In this blog, I explore the typical challenges with harnessing that log data at an enterprise scale, given how large these environments have become and how Splunk’s expertise and longevity in managing both unstructured and structured data can be applied to the world of observability and power that all important logs pillar.
A log contains an alphanumeric string of arbitrary length which is generated by a computing or communication subsystem intended to convey information about the state or state transitions taking place within those subsystems. And it is this information that makes the use of logs and log data vitally important in today’s modern platforms. This log data though is not without its challenges; there are no standards or typical conventions for either writing out the data - developers and vendors each have their own approaches here - or for how that data can be ingested, interpreted and used. An error code in one subsystem may mean something completely different in another. Furthermore, logs vary from subsystem to subsystem, from application to application, or in the case of today’s modern platforms, microservice to microservice and they are often highly flexible and subject to different interpretations.
Harnessing the power of log data has been difficult for a number of reasons:
These challenges all lead to the same typical outcome - this data is not used and therefore you miss out on the important and contextualised information that the data contains.
A popular misperception in the world of observability is that only specific or certain types of logs are important, such as K8, Java or app logs. In fact, all the different data sets out there are hugely important including those customised data sets that you might have in your environment. So when we talk about logs here at Splunk, we actually mean all the varied data sets in the environment. This is important as true observability is powered by context and it is that contextual information that we derive from all this data. And that is why the third pillar of O11y is so key in providing context visibility into the platform to get to the root cause quickly.
Splunk has over two decades of experience when it comes to harnessing the power of logs and other key data, from both an O11y and security perspective, with its award-winning data platform. This provides the power to the log pillar of Observability and it is fully integrated with metrics and traces - so you can easily drill down from either to aid in that rapid troubleshooting process to get to root-cause. The platform is built on the following unique principles:
Once the data has been ingested into Splunk, you can do some great stuff with the data:
Whilst it is great to simply ingest the data, what we have found is the need to be able to control, filter and process the incoming data before it is ingested into Splunk and this is where Splunk’s Edge Processor comes in. Splunk uses Splunk Processing Language as its syntax, which allows users to utilise one syntax which can be used across the board of data inputs as opposed to learning multiple technology languages. Edge Processor then takes it a step further in that it utilises SPL2, which is a fork of standard SPL designed to be used for big streaming data - for more info, on this exciting new feature, please click here. This allows users to easily filter out data that is not required, deduplicate data so that they are not doubling up on storage and perform processing prior to the data being ingested, essentially performing ETL during a stream, thus giving both faster visibility and more focused data in your O11y environment. Furthermore, Splunk provides an identical capability for metrics, using the Metric Pipeline Manager (MPM) technology to control and manage the ingestion of metrics.
The data is fully integrated into both metrics and traces, to provide a complete and granular view of the observed platform and is accessed within a single user interface. For example, once an issue has been identified through a metric, you can click and drill down into the relevant log data to obtain the contextual information and root cause as to why the issue occurred. Likewise for traces - the automatic filtering of the relevant log data enables an easy drill-down so that root cause can be quickly obtained and accelerate the mean time to resolution process. We can equally go the other way too; from a selected log back to the corresponding trace or metric to provide that additional context of what was happening during the execution of a request - which microservice and code version did the request hit - or how the system was performing from a metric perspective like a noisy neighbour incident with the pods. Splunk can also generate metrics from those logs - which allows you to track key info from the data and report on it. Splunk’s ML/AI engine can also analyse that data to identify anomalies, and baseline performance, assist with troubleshooting by identifying the root cause and predict performance based on what has happened before.
Harnessing the power of this data is key in the visibility of the modern platforms and will accelerate the ability to troubleshoot and resolve issues much faster, thus ensuring that the platform meets the 3 and 4 9’s of availability and allowing the development team to focus on innovating rather than troubleshooting and fixing code.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.