Observability

October 08, 2024

7 Minute Read

Introducing the Observability Center of Excellence: Taking Your Observability Game to the Next Level

By Mike Simon

Chasing false alerts is annoying. Worse is the nightmare of your systems going down with no alerts or telemetry to give you a heads-up.

If you’ve experienced this, you’re not alone. Before joining Splunk, I spent 14 years as an observability practitioner and leader for several Fortune 500 companies. In my 2.5 years with Splunk, I have had the opportunity to work with customers of all shapes and sizes.

Whether you’re in a massive enterprise or a nimble startup, a consistent desire has arisen: a comprehensive and agile approach to observability.

Let’s dive in and talk about the Observability Center of Excellence (CoE). If you’re tired of the same old fragmented observability approach, the CoE is the answer you’ve been looking for. Why? Becausse this CoE approach will:

Help simplify and streamline your observability strategy.
Provides a framework to maintain and mature a leading observability practice over time.

The problem with current observability practices: A monitoring mess

Let’s call it like it is: observability challenges in most organizations often extend beyond just the tools. Typically the problem lies in...:

The lack of a unified strategy
Fragmented tools
A reactive observability posture

The Observability Center of Excellence (CoE) offers a solution. It simplifies and unifies your observability efforts while providing a framework to continuously evolve your practice.

In my experience, these are the most common observability-related problems seen in organizations:

Low confidence in alerts and systems

Unfortunately, many organizations struggle with a high-volume of inconsistent and low confidence alert noise. The resulting “boy who cried wolf” effect increases the likelihood that genuine alerts are ignored, response times (MTTR), and ultimately downtime to mission critical IT offerings.

In addition, it’s not just about low/minimal trust of the alerts themselves — there’s often a lack of confidence in the systems generating them. If your telemetry is off or your systems aren’t set up to provide meaningful data, how can you trust anything?

Tools administration: A third job for engineers

For many engineers, managing observability tools is more of a side hustle than a main gig. They’ve got primary responsibilities to manage. Unfortunately, this often means that observability ends up as an afterthought and observability instrumentation is added & tuned in the admins “free time”. This results in:

Incomplete visibility
Low confidence in alerting
Decreased value (perceived or actual) of their observability tools.

The “How are we monitoring this?” moment

Ever had that realization as you prepare to join the war room call: “How are we (or are we even) monitoring this thing?” This is reality a lot of the time — observability is treated as an afterthought. Post-deployment observability instrumentation also increases implementation complexity and risk.

Mature observability organizations shift observability left in the SDLC, with the goal being comprehensive observability at deployment/creation time. Using observability as code is one way to make this part of the normal rhythm of the business.

Fragmented tools: The true cost of disconnected observability

It’s not uncommon to find multiple observability/monitoring tools providing overlapping visibility. For example, how many tools does your organization have that monitor servers? This fragmentation can lead to:

Increased costs related to downtime. Fragmented tools often lead to blind spots and increase complexity when IT is restoring service. Imagine trying to solve a puzzle only to find that half the pieces are missing. Historically, “best-of-breed” or tech-niche monitoring have provided deep insights into specific areas of IT services.

Today’s applications, however, are built on a tightly interwoven mesh of infrastructure, applications, and code, necessitating an observability approach that offers a comprehensive view of all telemetry data in context. Without this, teams will struggle to connect the dots during incidents, leading to prolonged downtime and higher costs. The time spent looking in different tools adopted throughout different pockets of the organization also has a cost.

Missed cost optimization opportunities. In addition to impacts to operational efficiency, operating fragmented observability tools hinders the organization's ability to effectively maintain the associated costs (direct and indirect). Examples include:

Increased licensing costs: More tools = more licenses = more cost. Fragmented tools make it difficult for organizations to optimize spend, leading to budget constraints that limit the ability to invest in other (gaps) critical observability enhancements.
Infrastructure overhead of self-hosted monitoring: If you’re running tools on-premises, you’re no stranger to the complexities and costs of maintaining the underlying infrastructure. Managing servers, storage, updates, and security patches not only consumes valuable resources — it also distracts from your focus on observability outcomes.
Training and knowledge gaps: Fragmented tools result in fragmented expertise. Each tool requires its own set of skills and expertise. These items span configurations, utilization, and integrations.
Increased stress and workload: While trying to troubleshoot issues, rationalizing what systems to access and where the data you need is located raises stress and means that issues take longer to be triaged and resolved. This impacts customer satisfaction and ultimately the business. If left unchecked engineering teams may be come burned out, which may lead to attrition.

Introducing the Observability Center of Excellence: The answer to the madness

So, how do you go from this chaotic state to something clean, comprehensive, and constantly evolving? Enter the Observability Center of Excellence (CoE). This may not yet exist at your organization, but it needs to. I’ll discuss what the team is in the rest of this post. Once you understand the need, look for future posts explaining how to get started.

The Observability CoE isn’t just a team that tinkers with tools, it’s the nerve center of your observability practice. It’s a hands-on group focused on delivering business value (such as enabling smooth operations and faster development cycles), through practical and impactful observability efforts. This isn’t about reactive firefighting — it’s about laying down the foundation for a constantly maturing observability framework that works for your organization today and scales for tomorrow. Let’s dive into some additional clarity regarding what the COE is.

1. Governance, standards, and best practices

The CoE plays a key role in defining the rules and standards for observability across the organization. It creates frameworks, best practices, and processes that ensure everyone is aligned. A primary objective is to ensure the organization understands:

What to observe
How to observe it
Why it’s important

By embedding observability early in the software development process, the CoE ensures observability becomes a proactive effort rather than an afterthought, making it a core part of your organization’s culture.

2. Not just a collection of tools

A big misconception I hear more than I’d like to admit is that “observability is all about having a bunch of monitoring/observability tools”. Observability is about having complete, unified visibility into your infrastructure, applications, and business.

Instead of blinding building a toolbox, the CoE ensures you’re creating a cohesive framework in which tools work in tandem, providing comprehensive objective-based visibility. The CoE helps you choose the right tools and rationalize some away, if they’re redundant or no longer adding value. It also is responsible for selecting the correct tools for the needs of the business and making sure that they actually get used.

Observability tools and capabilities: What does your business need?

The CoE isn’t just about strategy; it also guides you in choosing the right tools for the job — and cutting out those that aren’t delivering. Here’s what your observability capabilities should focus on. To ensure we are speaking the same lingo let’s break down some critical (not all) observability capabilities:

Infrastructure monitoring: Track the health of your core infrastructure.
Digital experience monitoring: Use real user monitoring (RUM) and synthetic testing to see what your users are experiencing, and quickly assess issues/impact.
Application performance monitoring (APM): Real-time insights into how your apps are performing and interacting with each other.
Centralized log management: Instead of scattered logs, centralize them to create a unified source of truth. Leverage your investment in log ingestion to solve observability use cases.
AIOps and event management: Let’s face it, IT breaks. When things hit the fan, the alerts start to fly. When they do, you need to find the signal in the alert noise. AIOps can correlate events and give you actionable insights. It’s also a centralized integration point, providing the ability to enrich IT component alerts with business context, so you can understand the true impact of incidents.

3. Cross-functional collaboration and education

A key strength of the CoE is that it operates without the confines of organizational silos. The cross functional nature of the CoE unites expertise from across your organization. Properly implemented CoEs include representation from IT, operations, business teams, and even developers (yes, you too!).

This collaboration doesn’t only improve observability — it builds education and awareness across teams. CoE members function as observability ambassadors and evangelists, spreading knowledge and helping other teams see how observability impacts their work and business outcomes.

4. Measurable success

A solid observability practice doesn’t run only on good vibes — you need metrics. The CoE ensures that your observability framework is constantly measured, leveraging, and creating KPIs specific to your observability practice. These KPIs help fine-tune and evolve the observability practice, keeping it aligned with your organization’s goals and growth.

Why the CoE is the secret sauce to comprehensive observability

The CoE is your secret weapon in creating a truly comprehensive observability practice. It’s not just about simplifying observability, it’s about turning it into a competitive advantage. By leveraging the CoE, you'll transform from reactive problem solving to proactive strategy development, driving governance and fostering collaboration.

It's easy to say "we want observability" but if everybody is in charge, nobody is in charge. Building out an empowered team specifically breaking free of silos can have tangible benefits, such as the ones listed below.

With a CoE in place, you’ll be positioned to:

Eliminate redundant tools and reduce costs.
Build a consistent, reliable framework for observability.
Empower teams to collaborate, educate, and innovate.
Tie observability to business value, ensuring you’re creating actionable insights that move the needle for your organization.

What’s next? The journey to maturity

This is just the start. In future posts, we’ll explore how to build out your Observability CoE, outline some specific tasks you might consider implementing, measure its success, and optimize your observability practice over time. From integrations to tuning and automations, there’s plenty more to cover. Let’s build that CoE and take your observability game to the next level.

If you’re passionate about learning about observability, I’d encourage you to:

Check out my teammates' observability articles and tutorials on Splunk Community
Watch our latest videos: Splunk Observability for Engineers

Observability resources, from experts

If you’re passionate about learning about observability, I’d encourage you to:

Check out our team's observability articles and tutorials on Splunk Community.
Watch our Splunk Observability for Engineers video series. Check out the entire series for more tutorials, insights, and new features and capabilities.

Series: Splunk for Observability Engineers

Mike Simon

Mike Simon is a seasoned observability leader and Developer Evangelist at Splunk, with over 16 years of experience in IT operations. Passionate about driving best practices in observability, he has a track record of optimizing monitoring frameworks for several Fortune 500 companies. With expertise spanning AIOps, cloud-native technologies, and digital experience monitoring, Mike is dedicated to empowering organizations to achieve comprehensive observability.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram