Most front-end developers and practitioners are familiar with real user monitoring (RUM) tools as a means to understand how end-users are perceiving the performance of applications. Few people, however, are aware of the history of the RUM market, going back more than two decades. Over the years, as the internet has evolved with new technologies, RUM tools have evolved in lock-step to cater to the ever changing needs and use cases of engineering teams. In this post, we aim to trace this history, and argue that we are again on the precipice of change, where legacy RUM tools are no longer good enough for new users and their needs.
The nascent days of the web were characterized by lack of standards in front-end architectures. Netscape and Microsoft were engaged in fierce browser wars, and each pushed for its preferred way of building dynamic experiences, interactions and animations in web content.
In 1995, Netscape released a language called Mocha, to cater to web designers. However, due to limited compatibility with Internet Explorer, Mocha had limited success. Shortly thereafter, Netscape partnered with Sun Microsystems, and released the first prototype for Javascript. Microsoft, meanwhile, worked on their own variant, called JScript.NET, and only included a limited support for Javascript in IE.
This lack of a standardization had a couple of implications. Firstly, Javascript did not get mass traction amongst web developers. Secondly, developers were further incentivized to build apps that rendered HTML on the server side, with little interactivity/dynamic content on the client side.
These enterprise apps were monoliths hosted in corporate data centers. They were designed as a collection of discrete pages, in an architecture known as a MPA (multi page application).
Netscape Navigator from the earliest stages of web monitoring
Since the pages were rendered server-side, user experience critically depended on the server processing time and the network latency (from server to client). RUM tools of this era measured document-loads and view-count, as well as server and network times for each page. The main RUM users in this era were ITOps and helpdesks:
Several technological shifts during ~2005-2010 led to an emergence of RUM 2.0 tools.
1. Client-side rendering began to gain in popularity, because of several reasons:
a. Javascript rapidly gained traction with the developer community in the mid 2000s. Ajax’s introduction in 2005, and JSON a few years later increased Javascript’s utility especially for creating interactive and dynamic content.
b. Google Chrome’s rapid growth (and its full support of Javascript) made it tenable for developers to reach a mass audience for interactive applications.
c. Hardware improvements (especially on mobile) allowed apps with rich interactive content and heavy Javascript modules to run seamlessly on the browser-side.
d. Emergence of third-party services, such as Google Maps, that were frequently invoked client-side.
RUM 1.0 tools did not monitor client-side interactions at all.
1. Emergence of public cloud platforms: Developers increasingly became responsible for deployment and maintenance, in addition to writing code. They needed tools that gave visibility into all parts of the stack (e.g. APM for backend, RUM for frontend, NPM for network) instead of siloed RUM 1.0 tools.
With the advent of the public clouds, customers had started shifting several components of their application from the DC to the cloud. The application itself was still predominantly a monolith, even though microservices were slowly becoming popular on the backend. On the front-end, this period saw the emergence of a variety of front-end Javascript frameworks such as Angular, React etc. Many websites in this period, even when rendered mostly server-side, had interactive components that were rendered client-side with these frameworks.
Main user of RUM was the ITOps team, and increasingly, the SRE.
1. Visibility into Javascript: Client-side rendering using Javascript meant that front-end developers needed to monitor the performance of the new code that ran on browsers. Use cases included aggregating js errors, and Identifying the slowest AJAX calls
2. Single Pane of Glass: SREs wanted a single pane of glass for answering the question “If a user feels that the site is not working, who do I blame? Browser/end user? Network? Monolithic application server? Databases?"
Since apps were mostly MPAs consisting of discrete pages, RUM tools continued to measure individual page-views and document-loads. Each document-load generated some activity on the browser, some network activity, some transactions on the monolith, and finally some database queries. These transactions happened linearly, in a sequence. The complexity of the system was relatively low, as each point in this chain made a request to a single (or relatively few) point.
The sequential nature of these activities meant that RUM tools, that measured the overall document-load time, and split it into server-time, network time, and DOM processing time, and page rendering time, were generally good enough
A few trends started disrupting RUM 2.0 tools.
1. Single Page Applications: With the increasing popularity of Javascript frameworks, SPAs started becoming very popular. An SPA featured an initial document load, and then a series of API requests in the form of XHRs/fetches. In the event of a route change, the document does not reload, and the page is refreshed via XHRs
a. Since the document-load only happens once, page-views/document-loads are less relevant as units of measurement. Instead, RUM solutions need to measure the performance of the many API requests and interactions between the browser and the resources (i.e., XHRs/fetches).
b. Client-side code became much more complex, and more prone to unforeseen bugs, errors and performance issues.
2. Cloud Native application development and the rise of highly distributed applications: caused a major shift in how applications are being built, deployed and operated. A modern application is a distributed system of services (or micro services) built in-house, and third-party, cloud services.
a. This explosion of complexity on the backend meant that if a transaction’s server processing time was too high, it was not trivial to say which sequence of operations on the backend led to the high response time.
b. APM 2.0 tools were not designed for capturing the inter-service delays at full-fidelity. If RUM 2.0 tools indicated high server processing time, there was no way to identify the root cause if the application was a cloud native SPA, and thus, the need for an end-to-end trace arose.
In recent years, application development is fast becoming fully cloud native. On the front-end, apps are increasingly written in js-frameworks such as React and Angular. The web page loads as a single document load, followed by multiple XHRs/fetches to a variety of resources. Very little rendering occurs on the server-side.
Front-ends are typically much more complex than before. They could be using multiple js frameworks, depend on multiple third-parties to work and perform correctly, and may bridge or touch multiple parts of a customer’s business. The backend is composed of several loosely coupled microservices and serverless functions.
Splunk RUM’s overview page links modern user-experience metrics with backend system performance
Main users of RUM are now SREs, and increasingly, the front-end developers.
1. Unit of Measurement: They need a tool that can measure individual browser-resource interactions, and not just document-loads. As an example, consider the infinite scrolling experience in Twitter. Five minutes of scrolling content could generate 100s of XHRs, any of which could be slow, but this would count as a single document load in a RUM 2.0 tool.
2. End-to-end Tracing: Cloud-native customers need to find the smoking-gun when a problem occurs. The only way a SRE can claim with certainty that a problem originated in the backend, is by looking at the exact backend trace as it propagates through a distributed backend. And they can only get this information, if the backend tracing is done at full-fidelity, without any sampling, and tied to the front-end activity.
Splunk Real User Monitoring has been engineered to provide visibility to the cloud-native applications, whose front ends are complex, featuring dozens of API calls to a variety of providers, and are typically single page apps (or hybrid apps) written in a framework such as React or Angular.
Splunk is bringing the philosophy of unsampled distributed tracing to front-end monitoring. This ensures that SREs and Front-end developers will never miss an anomaly, and will have visibility into every user-interaction, every resource, and every XHR made by any end-user. If a customer deploys both Splunk RUM and Splunk APM, they will have complete visibility end-to-end: for any request made by a browser, they will be able to identify the unique backend trace that was initiated by the browser request. In other words, engineers will always have that smoking-gun, to answer the question: “If an end-user has a problem, is it a front-end issue, a backend-issue, or something else?"
Splunk RUM seamlessly connects transactions from the frontend, to backend services
Splunk RUM is part of the Splunk Observability Cloud, which provides a single-pane of glass, for customers to gain unprecedented visibility into their infrastructure, applications, and logs. To find out more about Splunk Real User Monitoring, refer to this whitepaper and start a free trial today.
----------------------------------------------------
Thanks!
Shashwat Sehgal
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.