Chasing false alerts — or worse, having your system go down with no alerts or telemetry to give you a heads-up — is the nightmare we all want to avoid. If you’ve experienced this, you’re not alone. Before joining Splunk, I spent 14 years as an observability practitioner and leader for several Fortune 500 companies and in my 2.5 years with Splunk I have had the opportunity to work with customers of all shapes and sizes. Whether you’re in a massive enterprise or a nimble startup, a consistent desire has arisen: a comprehensive and agile approach to observability.
Let’s dive in and talk about the Observability Center of Excellence (CoE). If you’re tired of the same old fragmented observability approach, the CoE is the answer you’ve been looking for. Not only does it help simplify and streamline your observability strategy, but it also provides a framework to maintain and mature a leading observability practice over time.
Let’s call it like it is: observability challenges in most organizations often extend beyond just the tools. Typically the problem lies in the lack of a unified strategy, fragmented tools, and a reactive observability posture. The Observability Center of Excellence (CoE) offers a solution. It simplifies and unifies your observability efforts while providing a framework to continuously evolve your practice. In my experience, these are the most common observability-related problems seen in organizations:
Unfortunately, many organizations struggle with a high-volume of inconsistent and low confidence alert noise. The resulting “boy who cried wolf” effect increases the likelihood that genuine alerts are ignored, increases response times (MTTR), and ultimately downtime to mission critical IT offerings. In addition, it’s not just about low/minimal trust of the alerts themselves—there’s often a lack of confidence in the systems generating them. If your telemetry is off or your systems aren’t set up to provide meaningful data, how can you trust anything?
For many engineers, managing observability tools is more of a side hustle than a main gig. They’ve got primary responsibilities to manage. Unfortunately, this often means that observability ends up as an afterthought and observability instrumentation is added & tuned in the admins “free time”. This often results in incomplete visibility, low confidence in alerting, and overall decreased (perceived or actual) value of their organizations observability tools.
Ever had that realization as you prepare to join the war room call: “How are we (or are we even) monitoring this thing?” This is reality a lot of the time – observability is treated as an afterthought. In addition to lack of visibility, post deployment observability instrumentation increases implementation complexity and risk. Mature observability organizations shift observability left in the SDLC, with the goal being comprehensive observability at deployment/creation time. Using observability as code is one way to make this part of the normal rhythm of the business.
It’s not uncommon to find multiple observability/monitoring tools providing overlapping visibility. For example, how many tools does your organization have that monitor servers? This fragmentation can lead to:
Increased Costs Related to Downtime
Fragmented tools often lead to blind spots and increase complexity when IT is restoring service. Imagine trying to solve a puzzle only to find that half the pieces are missing. Historically, “best-of-breed” or tech-niche monitoring have provided deep insights into specific areas of IT services. However, today’s applications are built on a tightly interwoven mesh of infrastructure, applications, and code, necessitating an observability approach that offers a comprehensive view of all telemetry data in context. Without this, teams will struggle to connect the dots during incidents, leading to prolonged downtime and higher costs. The time spent looking in different tools adopted throughout different pockets of the organization also has a cost.
Missed Cost Optimization Opportunities
In addition to impacts to operational efficiency, operating fragmented observability tools hinders the organization's ability to effectively maintain the associated costs (direct & indirect). Examples include:
So, how do you go from this chaotic state to something clean, comprehensive, and constantly evolving? Enter the Observability Center of Excellence (CoE). This may not yet exist at your organization, but it needs to. I’ll discuss what the team is in the rest of this post. Once you understand the need, look for future posts explaining how to get started.
The Observability CoE isn’t just a team that tinkers with tools, it’s the nerve center of your observability practice.It’s a hands-on group focused on delivering business value (such as enabling smooth operations and faster development cycles), through practical and impactful observability efforts. This isn’t about reactive firefighting—it’s about laying down the foundation for a constantly maturing observability framework that works for your organization today and scales for tomorrow. Let’s dive into some additional clarity regarding what the COE is.
The CoE plays a key role in defining the rules and standards for observability across the organization. It creates frameworks, best practices, and processes that ensure everyone is aligned. A primary objective is to ensure the organization understands what to observe, how to observe it, and why it’s important. By embedding observability early in the software development process, the CoE ensures observability becomes a proactive effort rather than an afterthought, making it a core part of your organization’s culture.
A big misconception I hear more than I’d like to admit is that “observability is all about having a bunch of monitoring/observability tools”. Observability is about having complete, unified visibility into your infrastructure, applications, and business. The CoE makes sure you’re not just blindly building a toolbox. It ensures you’re creating a cohesive framework in which tools work in tandem, providing comprehensive objective-based visibility. The CoE helps you choose the right tools and rationalize some away, if they’re redundant or no longer adding value. It also is responsible for selecting the correct tools for the needs of the business and making sure that they actually get used.
Observability Tools and Capabilities: What Does Your Business Need?
The CoE isn’t just about strategy; it also guides you in choosing the right tools for the job—and cutting out those that aren’t delivering. Here’s what your observability capabilities should focus on. To ensure we are speaking the same lingo let’s break down some critical (not all) observability capabilities:
A key strength of the CoE is that it operates without the confines of organizational silos. The cross functional nature of the CoE unites expertise from across your organization. Properly implemented CoEs include representation from IT, operations, business teams, and even developers (yes, you too!).This collaboration doesn’t just improve observability; it builds education and awareness across teams. CoE members function as observability ambassadors or evangelists, spreading knowledge and helping other teams see how observability impacts their work and business outcomes.
A solid observability practice doesn’t just run on good vibes – you need metrics. The CoE ensures that your observability framework is constantly measured, leveraging (and creating) KPIs specific to your observability practice. These KPIs help fine-tune and evolve the observability practice, keeping it aligned with your organization’s goals and growth.
The CoE is your secret weapon in creating a truly comprehensive observability practice. It’s not just about simplifying observability, it’s about turning it into a competitive advantage. By leveraging the CoE, you'll transform from reactive problem solving to proactive strategy development, driving governance and fostering collaboration.
It's easy to say "we want observability" but if everybody is in charge, nobody is in charge. Building out an empowered team specifically breaking free of silos can have tangible benefits, such as the ones listed below.
With a CoE in place, you’ll be positioned to:
This is just the start. In future posts (dropping every 2 weeks), we’ll explore how to build out your Observability CoE, outline some specific tasks you might consider implementing, measure its success, and optimize your observability practice over time. From integrations to tuning and automations, there’s plenty more to cover. Let’s build that CoE and take your observability game to the next level.
If you’re passionate about learning more about observability, I’d encourage you to check out my teammates Observability content on Splunk’s community blog and watch some of our latest videos on YouTube (Splunk Observability for Engineers)
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.