As a principal engineer on the Splunk Real User Monitoring (RUM) team who is responsible for measuring and monitoring our service-level agreements (SLAs) and service-level objectives (SLOs), I depend on observability to measure, visualize and troubleshoot our services. Our key SLA is to guarantee that our services are available and accessible 99.9% of the time. Our application is as complex as any modern application — multiple micro-frontends backed by a shared GraphQL server orchestrating requests across a broad range of microservices. The success of our user experience largely depends on ~850 million spans per minute from our customer apps getting ingested into our ingest pipelines and being processed downstream by our systems in a timely manner to make insights available to our customers via our UI application. We are committed to our SLAs and SLOs and need to be alerted on time when we don’t meet them to be able to take swift remedial action.
Here is how we used Splunk Observability Cloud to detect a critical incident in production and analyze the root cause in a matter of minutes.
The incident was discovered when our on-call engineer received alerts from 2 sources:
1. The RUM detector for GraphQL errors fired an alert indicating that the number of failed GraphQL requests was above our acceptable threshold:
Alert fired by the RUM detector when RUM GraphQL errors went above the defined threshold
2. The real browser checks configured in Splunk Synthetics that test the RUM UI periodically, triggered an alert after 3 consecutive failures:
Alert received from Splunk Synthetics after 3 consecutive real browser check failures
The alerts provided all the information required to understand what the actual problem was:
Also, drilling into the failed test run in Splunk Synthetics helped visualize the browser requests that were triggered for that specific run. It became quite evident that all the GraphQL requests in that run failed with a 403 response code.
Browser requests for the failed run viewed in Splunk Synthetics
The next step was to identify the scale of the issue to update our status page for our customers. We were able to immediately pivot to Splunk RUM’s Tag Spotlight experience where we could effortlessly view the aggregate error counts on our GraphQL endpoint without scouring raw data or crafting complex queries. The Tag Spotlight experience provided a detailed analysis of the errors across several dimensions like environment, HTTP status code, application, etc. We were able to confirm that our other production environments were stable and that the problem was truly specific to the us1 environment.
Splunk RUM Tag Spotlight experience showing 403 errors for GraphQL API calls in the “us1” environment
The immediate question was to figure out what was going on with our GraphQL server. We were anxious to find answers to these questions:
To obtain a big-picture overview of our back-end services, we opened Splunk APM’s dependency map to explore all services upstream and downstream of our shared GraphQL server. It became quickly clear that the problem was with our shared internal gateway service and not with the GraphQL server itself.
Splunk APM’s dependency map showing errors in the shared gateway service (upstream of GraphQL server)
Additionally, Splunk APM presented some example traces to obtain insights into the specific components within the gateway that were throwing a 403 Forbidden error. The example traces greatly helped to escalate the issue with the team that owned the internal gateway service thereby eliminating the need to look for the needle in the haystack.
Splunk APM’s trace view displaying a trace corresponding to the 403 error
We were able to partner with the owner team and prioritize a stopgap solution as soon as possible. Once the solution was rolled out, the Splunk Observability products also gave us the ability to validate that the incident was resolved 100%.
Splunk Synthetics showed successful test runs in us1 after the fix was rolled out.
Splunk Synthetics displaying successful test runs after a fix was rolled out
Splunk RUM’s Tag Spotlight page began to report only 200 HTTP status codes for all /rum GraphQL requests which was a huge relief!
Splunk RUM Tag Spotlight displaying 200 HTTP status code for all GraphQL requests after the issue was resolved
As an engineering team that owns products used by real customers and deploys features and enhancements to production frequently, we’ve taken a few measures to monitor our application using Splunk Observability Cloud — these measures have had a large return on investment and are still proving to be helpful to meet our SLAs and SLOs.
The nice thing about getting started with our observability journey was that we could start small by focusing on the critical aspects of our application, learn and refine our methods as we explored and observed things in production. With incremental efforts, we were able to add value and obtain a shared, comprehensive understanding of our system’s architecture, health and performance over time.
Also, observability with Splunk Observability Cloud helped create more accurate post-incident reviews as all involved parties were able to examine documented records of real-time system behavior instead of piecing events together from siloed, individual sources. This data-driven guidance helped our teams understand why incidents occurred so we could better prevent and handle future incidents.
If you’re interested in empowering your teams to be data-driven and optimizing performance and productivity, try Splunk Observability Cloud today.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.