COVID-19 has made website performance and user-experience more critical than ever. However, in general, slow loading content, poor interactivity, visual instability, and errors negatively impact user engagement and conversion. Today, we’re proud to announce the general availability of Splunk Real User Monitoring (RUM) to help you troubleshoot customer-facing issues faster, and improve your user experience. Along with Splunk Synthetic Monitoring, Splunk APM, Splunk Infrastructure Monitoring, Splunk Log Observer, and Splunk On-Call, Splunk RUM helps measure and improve the impact of code on your customers to deliver better digital experiences.
Depending on your company’s incident response protocol and the severity of an issue, there’s probably different approaches to production incidents versus general support tickets from frustrated users. Regardless of severity, Splunk RUM helps on-call engineers and service owners quickly find and fix latency, errors, and anomalies that negatively impact user experience.
Since the end user experience is a by-product of time spent in both web browsers and backend services, Splunk RUM gives you end-to-end visibility of every service, component, resource, and third-party dependency. Start at the RUM overview page to receive a general summary of user experience. Page performance details show the 90th percentile of your page loads, errors across your frontend and backend services, core web vitals, and other metrics critical to understanding how end users are perceiving application performance.
From the overview page, click into any URL to begin troubleshooting, in this case we’ll select the /cart/*. A high latency value for /cart/* indicates that an end user might have a poor experience because of slow endpoints of the type /cart/* (in this instance, /cart/checkout).
This takes us to Tag Spotlight, where we can see which tags and dimensions are correlated with high latency values for /cart/*. Percentiles (p50, p90, p99) help quantify how many user sessions are affected by the slow endpoint, and are shown alongside dimensional information about your user sessions including cities, countries, web browsers, devices types, and information about HTTP calls sent or received. You can click any of the tag values to easily apply filters on the data, and understand what kinds of users are affected.
By clicking into “Sessions” you can find individual sessions where users are affected by the slow endpoint, per which filters you have applied. Click into an individual sessionID to show the waterfall view of full page loads, including all page resources and third-party dependencies.
Clicking into a session ID will give you the waterfall view of the entire user session. The waterfall also highlights/expands the specific call to /cart/checkout. Simply hovering over latency will show you the end-to-end breakdown of that span. In this case TTFB (time-to-first-byte), a measurement of time spent in the web server (backend services), makes up nearly all of the session. This strongly suggests that the problem is within some service in the backend.
To further answer the question, “is this problem a front-end or a backend issue?” Splunk RUM stitches together frontend and backend traces with an end-to-end view of user activity. In this instance, since this POST request produces some activity within the backend, RUM provides an “APM” hyperlink next to this span. Hovering over “APM” provides a performance summary for the critical backend components.
Clicking into “Workflow Name” provides a service map, where we can visualize latency across different components and services in our backend architecture. Very clearly, we can see that the slowness is correlated with errors on the ‘paymentservice’.
Hovering over any component of a service map will provide additional details to help you isolate latency, errors, and anomalies. Here we’re viewing the payment service’s errors, requests, and latency across the entire workflow, complete with entity tags.
Finally, you can also click on ‘traceid’ with the modal that appears when you hover on the ‘APM’ hyperlink in the session waterfall. This takes you to the exact backend trace that was generated when /cart/checkout was called. It is critical to see the exact backend trace (and not a sampled trace), so that you can understand the exact reason behind slow performance.
On going through the trace, you can clearly see that the checkoutservice calls the paymentservice several times, and each time, it returns a 401 status code (due to invalid credentials), until it finally times out with an ‘Invalid request’ response.
As seen, from a general understanding of page performance, to viewing the health of backend services, Splunk RUM helps you quickly understand which components and dependencies across your entire distributed system impact your end user experience.
For general troubleshooting, Splunk’s near infinite cardinality helps you quickly correlate issues to find root cause faster, even across complex distributed systems with hundreds of dependencies. Easily view general end user experience (RUM’s overview page), dig deeper into errors and latency in end-to-end traces (APM and RUM), and sort through individual logs in Splunk Log Observer.
For measuring and improving frontend performance, Splunk RUM is unique to other RUM solutions. To help you understand your user journey and the performance of all your dynamic components, Splunk RUM measures individual browser-resource interactions across entire user sessions. This helps quantify performance in front-end frameworks where a single page may interact with tens or hundreds of resources (internal or third-party) to dynamically load content.
Splunk RUM is open-source friendly, framework agnostic, and supports OpenTelemetry, so engineers spending time on-call access unbroken, standardized traces from transactions spanning from web browsers through database calls, and even third-party dependencies.
Splunk RUM’s overview page places Google’s Core Web Vitals, key W3C timings, and longtasks as central components in your overview screen.
A brief overview of the three Core Web Vitals measurements are:
Starting this summer, Google will include a page’s core web vitals scores as a component to how they rank pages. You can also find core web vitals measurements within each waterfall chart breakdown, and marked within each user sessionID.
Long tasks are JavaScript tasks that block the web browser's main thread, causing the browser’s UI to freeze, thereby making the web page unresponsive to user inputs. Splunk RUM shows you long task length (how long the interactivity was blocked), and long task count (how many long tasks per page).
Time to First Byte (TTFB), a measurement of server-side responsiveness, is a measurement of how long the web server took to deliver the first byte of a page as requested by the client’s web browser. TTFB helps you quickly compare latency for your backend services alongside client side performance.
Improving page performance requires utilizing lab and field data. Splunk RUM is your field data from actual users on any given combination of web browser, Internet Service Provider, and device. Splunk Synthetic Monitoring, formerly Rigor, provides you with a controlled lab environment to proactively test, measure, alert, and improve your web experience, even before you go live.
Splunk Synthetic Monitoring offers the deepest web optimization capabilities of any APM or Observability suite. Our best-in-class solution help you optimize page performance with the following abilities:
Splunk RUM quickly shows you latency across your entire stack, from the end user experience from dynamically loading content, to backend services and errors or latency from database calls. Connected into the larger Splunk platform, it helps you rapidly measure user experience, and understand the impact your entire architecture has on your end users.
Sign up for a free trial, or go to the Splunk Real User Monitoring product page for more!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.