Today we have announced the latest major updates to our real-time monitoring and observability platform. Over the last few weeks, we have delivered a ton of great new features and integrations, but, above all, major enhancements to Service Bureau, a unique set of capabilities that enable central Observability Teams to efficiently provide the entire organization with an easy-to-consume observability service.
The DevOps paradigm of “you build it, you own it” boosts agility in part by decentralizing operational responsibility to individual teams. As a result, more people across the organization now need access to monitoring, and this decentralization can easily lead to fragmentation of tools and data, potentially resulting in higher costs and, even worse, highly inefficient operations. For example, how do you coordinate incident response when each individual team in a company has different dashboard views, or even different names for the same metric?
As a result, customers are seeing the need for a centralized monitoring solution that provides a single source of truth for the company without sacrificing speed. Most monitoring solutions, however, were designed for a world in which operations were the responsibility of a small group of experts and lack the necessary capabilities to enable monitoring-as-a-service. At SignalFx, on the other hand, we have made enabling monitoring-as-a-service a key strategic priority.
The following capabilities are critical to success when teams are operating a central observability service for an entire organization:
Centralized management of teams and permissions
Clearly defined and reusable best practices via standardized dashboards and alerts
Enablement of self-service consumption (empowering users to search for data they need, or to easily create and customize dashboards and alerts)
Programmability (the ability to make sophisticated ad-hoc queries or create charts and alerts at scale via API)
Detailed usage control and reporting
SignalFx has Service Bureau features to help teams succeed in each of these areas, which we’ll cover in more detail below.
Each team in an organization has a specific focus when it comes to observability data and content. This might be specific metrics related to the service that they monitor or manage, or a subset of data from standard metrics provided to the whole organization.
SignalFx users can be grouped into teams in order to connect specific monitoring content (e.g dashboard groups and alert detectors) to specific users. The team landing page below enables everyone on a team to view and quickly navigate to content that is marked as relevant to them.
This also makes it straightforward to define permissions on dashboard groups, dashboards, and detectors. It’s possible for teams to create dashboards that can only be edited by members of that team, or for administrators to protect dashboards and detectors that are monitoring critical or sensitive information.
SignalFx uses tokens to authenticate and track usage. These access tokens not only provide users with access to the SignalFx API and the ability to send data to SignalFx – they also enable administrators to configure integrations, manage users and teams, and control the volume of data that can be sent by specifying limits for each access token. This ability to efficiently set up permissions and usage controls across the organization is essential for teams delivering monitoring as a service.
We’ve also made enhancements to the way that we show usage in SignalFx to aid in management reporting and transparency in our customers’ organizations. In some cases, this increased understanding of how users are leveraging SignalFx and the specific signals they are monitoring has led to cost savings relative to other monitoring solutions, with teams consolidating their metric usage by as much as 50%.
Mirrored Dashboards enable the broad distribution of dashboards implementing best practices by letting users create “mirrors” of them. As opposed to copying a dashboard outright, updates to mirrored dashboards automatically propagate to maintain consistency and prevent the buildup of redundant or outdated content. Updates to mirrored dashboards also take place without removing local filters or other customizations specified by a user for their mirror, which enables them to continue tailoring mirrors to their individual needs without affecting other users of the same dashboard.
This makes it easy for individuals to borrow dashboards created by subject matter experts, or for teams that work on interdependent services to achieve shared visibility without the friction of maintaining multiple individual dashboard copies.
To support teams with a monitoring-as-code approach, we have a Terraform provider that codifies SignalFx charts, dashboards, detectors. This makes it possible to programmatically create, manage, and version control them. This provider was originally created and maintained by our friends at Yelp for years before being taken over as an official SignalFx project.
The below example shows how you would create an alert detector in SignalFx using the Terraform provider:
resource "signalfx_detector" "application_delay" { count = "${length(var.clusters)}" name = " max average delay - ${var.clusters[count.index]}" description = "your application is slow - ${var.clusters[count.index]}" max_delay = 30 program_text = <<-EOF signal = data('app.delay', filter('cluster','${var.clusters[count.index]}'), extrapolation='last_value', maxExtrapolations=5).max() detect(when(signal > 60, '5m')).publish('Processing old messages 5m') detect(when(signal > 60, '30m')).publish('Processing old messages 30m') EOF rule { description = "maximum > 60 for 5m" severity = "Warning" detect_label = "Processing old messages 5m" notifications = ["Email,foo-alerts@bar.com"] } rule { description = "maximum > 60 for 30m" severity = "Critical" detect_label = "Processing old messages 30m" notifications = ["Email,foo-alerts@bar.com"] } } provider "signalfx" {} variable "clusters" { default = ["clusterA", "clusterB"] }
For most monitoring solutions, searching for metrics typically requires the user to know the specific name of a metric, or to search across a particular hierarchy. This makes it difficult to perform ad-hoc queries or build new charts and dashboards, and creates a worse experience for users who are less familiar with metric names, or were not originally involved in instrumenting those metrics.
Metric searches in SignalFx take into account the metadata that users are submitting alongside their data (i.e. hostname, service, integration, customer ID) and common patterns in naming to automatically suggest filters that can help narrow a search. Simply search by whatever you know (a metric’s name, the names of its attributes, or values like the service it comes from), and when you’ve found your metric, you can start building a new chart with one click. This is an improvement over our existing metrics search capabilities, and is also a step above the standard for other data products which can add even more confusion to the task of finding the data you want.
With all of this functionality combined, a central observability team can cater to a mix of skills and monitoring needs across teams in their organization. For example, one team may want to create custom monitoring content starting from scratch – the powerful metadata search and browse capabilities of the metric finder will be much appreciated by them. Another team may be sufficiently served by adding some mirror customizations on top of a standard dashboard provided by the central team. This balance between flexibility and control is a requirement for complex software organizations.
DevOps was born from the need for software organizations to respond to customer needs more quickly than they had ever done before, but this increased agility has created new challenges around how to effectively coordinate and manage observability. A common tool for monitoring and observability allows teams to decrease training burden, allow reuse, and improve quality, while also tracking usage to ensure that teams are seeing appropriate return on investment for their observability data.
Download Our Container Monitoring and Observability Whitepaper Now »
Thanks,
Alberto Farronato and Aaron Sun
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.