Chasing false alerts is annoying. Worse is the nightmare of your systems going down with no alerts or telemetry to give you a heads-up.
If you’ve experienced this, you’re not alone. Before joining Splunk, I spent 14 years as an observability practitioner and leader for several Fortune 500 companies. In my 2.5 years with Splunk, I have had the opportunity to work with customers of all shapes and sizes.
Whether you’re in a massive enterprise or a nimble startup, a consistent desire has arisen: a comprehensive and agile approach to observability.
Let’s dive in and talk about the Observability Center of Excellence (CoE). If you’re tired of the same old fragmented observability approach, the CoE is the answer you’ve been looking for. Why? Becausse this CoE approach will:
Let’s call it like it is: observability challenges in most organizations often extend beyond just the tools. Typically the problem lies in...:
The Observability Center of Excellence (CoE) offers a solution. It simplifies and unifies your observability efforts while providing a framework to continuously evolve your practice.
In my experience, these are the most common observability-related problems seen in organizations:
Unfortunately, many organizations struggle with a high-volume of inconsistent and low confidence alert noise. The resulting “boy who cried wolf” effect increases the likelihood that genuine alerts are ignored, response times (MTTR), and ultimately downtime to mission critical IT offerings.
In addition, it’s not just about low/minimal trust of the alerts themselves — there’s often a lack of confidence in the systems generating them. If your telemetry is off or your systems aren’t set up to provide meaningful data, how can you trust anything?
For many engineers, managing observability tools is more of a side hustle than a main gig. They’ve got primary responsibilities to manage. Unfortunately, this often means that observability ends up as an afterthought and observability instrumentation is added & tuned in the admins “free time”. This results in:
Ever had that realization as you prepare to join the war room call: “How are we (or are we even) monitoring this thing?” This is reality a lot of the time — observability is treated as an afterthought. Post-deployment observability instrumentation also increases implementation complexity and risk.
Mature observability organizations shift observability left in the SDLC, with the goal being comprehensive observability at deployment/creation time. Using observability as code is one way to make this part of the normal rhythm of the business.
It’s not uncommon to find multiple observability/monitoring tools providing overlapping visibility. For example, how many tools does your organization have that monitor servers? This fragmentation can lead to:
Increased costs related to downtime. Fragmented tools often lead to blind spots and increase complexity when IT is restoring service. Imagine trying to solve a puzzle only to find that half the pieces are missing. Historically, “best-of-breed” or tech-niche monitoring have provided deep insights into specific areas of IT services.
Today’s applications, however, are built on a tightly interwoven mesh of infrastructure, applications, and code, necessitating an observability approach that offers a comprehensive view of all telemetry data in context. Without this, teams will struggle to connect the dots during incidents, leading to prolonged downtime and higher costs. The time spent looking in different tools adopted throughout different pockets of the organization also has a cost.
Missed cost optimization opportunities. In addition to impacts to operational efficiency, operating fragmented observability tools hinders the organization's ability to effectively maintain the associated costs (direct and indirect). Examples include:
So, how do you go from this chaotic state to something clean, comprehensive, and constantly evolving? Enter the Observability Center of Excellence (CoE). This may not yet exist at your organization, but it needs to. I’ll discuss what the team is in the rest of this post. Once you understand the need, look for future posts explaining how to get started.
The Observability CoE isn’t just a team that tinkers with tools, it’s the nerve center of your observability practice. It’s a hands-on group focused on delivering business value (such as enabling smooth operations and faster development cycles), through practical and impactful observability efforts. This isn’t about reactive firefighting — it’s about laying down the foundation for a constantly maturing observability framework that works for your organization today and scales for tomorrow. Let’s dive into some additional clarity regarding what the COE is.
The CoE plays a key role in defining the rules and standards for observability across the organization. It creates frameworks, best practices, and processes that ensure everyone is aligned. A primary objective is to ensure the organization understands:
By embedding observability early in the software development process, the CoE ensures observability becomes a proactive effort rather than an afterthought, making it a core part of your organization’s culture.
A big misconception I hear more than I’d like to admit is that “observability is all about having a bunch of monitoring/observability tools”. Observability is about having complete, unified visibility into your infrastructure, applications, and business.
Instead of blinding building a toolbox, the CoE ensures you’re creating a cohesive framework in which tools work in tandem, providing comprehensive objective-based visibility. The CoE helps you choose the right tools and rationalize some away, if they’re redundant or no longer adding value. It also is responsible for selecting the correct tools for the needs of the business and making sure that they actually get used.
The CoE isn’t just about strategy; it also guides you in choosing the right tools for the job — and cutting out those that aren’t delivering. Here’s what your observability capabilities should focus on. To ensure we are speaking the same lingo let’s break down some critical (not all) observability capabilities:
A key strength of the CoE is that it operates without the confines of organizational silos. The cross functional nature of the CoE unites expertise from across your organization. Properly implemented CoEs include representation from IT, operations, business teams, and even developers (yes, you too!).
This collaboration doesn’t only improve observability — it builds education and awareness across teams. CoE members function as observability ambassadors and evangelists, spreading knowledge and helping other teams see how observability impacts their work and business outcomes.
A solid observability practice doesn’t run only on good vibes — you need metrics. The CoE ensures that your observability framework is constantly measured, leveraging, and creating KPIs specific to your observability practice. These KPIs help fine-tune and evolve the observability practice, keeping it aligned with your organization’s goals and growth.
The CoE is your secret weapon in creating a truly comprehensive observability practice. It’s not just about simplifying observability, it’s about turning it into a competitive advantage. By leveraging the CoE, you'll transform from reactive problem solving to proactive strategy development, driving governance and fostering collaboration.
It's easy to say "we want observability" but if everybody is in charge, nobody is in charge. Building out an empowered team specifically breaking free of silos can have tangible benefits, such as the ones listed below.
With a CoE in place, you’ll be positioned to:
This is just the start. In future posts, we’ll explore how to build out your Observability CoE, outline some specific tasks you might consider implementing, measure its success, and optimize your observability practice over time. From integrations to tuning and automations, there’s plenty more to cover. Let’s build that CoE and take your observability game to the next level.
If you’re passionate about learning about observability, I’d encourage you to:
If you’re passionate about learning about observability, I’d encourage you to:
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.