Announcing the General Availability of Splunk Real User Monitoring (RUM)

By Mat Ball

COVID-19 has made website performance and user-experience more critical than ever. However, in general, slow loading content, poor interactivity, visual instability, and errors negatively impact user engagement and conversion. Today, we’re proud to announce the general availability of Splunk Real User Monitoring (RUM) to help you troubleshoot customer-facing issues faster, and improve your user experience. Along with Splunk Synthetic Monitoring, Splunk APM, Splunk Infrastructure Monitoring, Splunk Log Observer, and Splunk On-Call, Splunk RUM helps measure and improve the impact of code on your customers to deliver better digital experiences.

Quickly Troubleshoot Customer-Facing Issues

Depending on your company’s incident response protocol and the severity of an issue, there’s probably different approaches to production incidents versus general support tickets from frustrated users. Regardless of severity, Splunk RUM helps on-call engineers and service owners quickly find and fix latency, errors, and anomalies that negatively impact user experience.

End-To-End Visibility To Find Dependencies Impacting End User Experience

Since the end user experience is a by-product of time spent in both web browsers and backend services, Splunk RUM gives you end-to-end visibility of every service, component, resource, and third-party dependency. Start at the RUM overview page to receive a general summary of user experience. Page performance details show the 90th percentile of your page loads, errors across your frontend and backend services, core web vitals, and other metrics critical to understanding how end users are perceiving application performance.

From the overview page, click into any URL to begin troubleshooting, in this case we’ll select the /cart/*. A high latency value for /cart/* indicates that an end user might have a poor experience because of slow endpoints of the type /cart/* (in this instance, /cart/checkout).

This takes us to Tag Spotlight, where we can see which tags and dimensions are correlated with high latency values for /cart/*. Percentiles (p50, p90, p99) help quantify how many user sessions are affected by the slow endpoint, and are shown alongside dimensional information about your user sessions including cities, countries, web browsers, devices types, and information about HTTP calls sent or received. You can click any of the tag values to easily apply filters on the data, and understand what kinds of users are affected.

By clicking into “Sessions” you can find individual sessions where users are affected by the slow endpoint, per which filters you have applied. Click into an individual sessionID to show the waterfall view of full page loads, including all page resources and third-party dependencies.

Clicking into a session ID will give you the waterfall view of the entire user session. The waterfall also highlights/expands the specific call to /cart/checkout. Simply hovering over latency will show you the end-to-end breakdown of that span. In this case TTFB (time-to-first-byte), a measurement of time spent in the web server (backend services), makes up nearly all of the session. This strongly suggests that the problem is within some service in the backend.

To further answer the question, “is this problem a front-end or a backend issue?” Splunk RUM stitches together frontend and backend traces with an end-to-end view of user activity. In this instance, since this POST request produces some activity within the backend, RUM provides an “APM” hyperlink next to this span. Hovering over “APM” provides a performance summary for the critical backend components.

Clicking into “Workflow Name” provides a service map, where we can visualize latency across different components and services in our backend architecture. Very clearly, we can see that the slowness is correlated with errors on the ‘paymentservice’.

Hovering over any component of a service map will provide additional details to help you isolate latency, errors, and anomalies. Here we’re viewing the payment service’s errors, requests, and latency across the entire workflow, complete with entity tags.

Finally, you can also click on ‘traceid’ with the modal that appears when you hover on the ‘APM’ hyperlink in the session waterfall. This takes you to the exact backend trace that was generated when /cart/checkout was called. It is critical to see the exact backend trace (and not a sampled trace), so that you can understand the exact reason behind slow performance.

On going through the trace, you can clearly see that the checkoutservice calls the paymentservice several times, and each time, it returns a 401 status code (due to invalid credentials), until it finally times out with an ‘Invalid request’ response.

As seen, from a general understanding of page performance, to viewing the health of backend services, Splunk RUM helps you quickly understand which components and dependencies across your entire distributed system impact your end user experience.

High Cardinality to Pinpoint Problems Faster

For general troubleshooting, Splunk’s near infinite cardinality helps you quickly correlate issues to find root cause faster, even across complex distributed systems with hundreds of dependencies. Easily view general end user experience (RUM’s overview page), dig deeper into errors and latency in end-to-end traces (APM and RUM), and sort through individual logs in Splunk Log Observer.

Benchmark and Improve Page Performance

For measuring and improving frontend performance, Splunk RUM is unique to other RUM solutions. To help you understand your user journey and the performance of all your dynamic components, Splunk RUM measures individual browser-resource interactions across entire user sessions. This helps quantify performance in front-end frameworks where a single page may interact with tens or hundreds of resources (internal or third-party) to dynamically load content.

Splunk RUM is open-source friendly, framework agnostic, and supports OpenTelemetry, so engineers spending time on-call access unbroken, standardized traces from transactions spanning from web browsers through database calls, and even third-party dependencies.

Core Web Vitals And Long Tasks

Splunk RUM’s overview page places Google’s Core Web Vitals, key W3C timings, and longtasks as central components in your overview screen.

A brief overview of the three Core Web Vitals measurements are:

Largest Contentful Paint (LCP), a measurement of when your largest content displays, helps you understand when end-users first see marquee images or content display(eg. nav bar, video player)
First Input Delay (FID), a measurement of page interactivity, helps you understand how much delay exists between end-user interactions (clicks) and the web browser’s response
Cumulative Layout Shift (CLS), a score representing your page’s visual stability, helps you understand how end-users experience sudden, unexpected shifts in your page’s layout (typically resulting from content that resizes dynamically).

Starting this summer, Google will include a page’s core web vitals scores as a component to how they rank pages. You can also find core web vitals measurements within each waterfall chart breakdown, and marked within each user sessionID.

Long tasks are JavaScript tasks that block the web browser's main thread, causing the browser’s UI to freeze, thereby making the web page unresponsive to user inputs. Splunk RUM shows you long task length (how long the interactivity was blocked), and long task count (how many long tasks per page).

Time to First Byte (TTFB), a measurement of server-side responsiveness, is a measurement of how long the web server took to deliver the first byte of a page as requested by the client’s web browser. TTFB helps you quickly compare latency for your backend services alongside client side performance.

Splunk Digital Experience Monitoring (DEM) And Synthetic Monitoring

Improving page performance requires utilizing lab and field data. Splunk RUM is your field data from actual users on any given combination of web browser, Internet Service Provider, and device. Splunk Synthetic Monitoring, formerly Rigor, provides you with a controlled lab environment to proactively test, measure, alert, and improve your web experience, even before you go live.

Splunk Synthetic Monitoring offers the deepest web optimization capabilities of any APM or Observability suite. Our best-in-class solution help you optimize page performance with the following abilities:

Connect endpoint uptime with Splunk APM, Infrastructure Monitoring, On-Call, Core and Enterprise
Benchmark performance with Lighthouse scores and 50+ modern performance metrics (including Web Vitals)
View your actual user-experience with filmstrips and screen recordings on web and mobile
Automatically pass/fail builds in Jenkins pipelines per your performance budgets

Deliver Better End User Experience

Splunk RUM quickly shows you latency across your entire stack, from the end user experience from dynamically loading content, to backend services and errors or latency from database calls. Connected into the larger Splunk platform, it helps you rapidly measure user experience, and understand the impact your entire architecture has on your end users.

Sign up for a free trial, or go to the Splunk Real User Monitoring product page for more!

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.