Disconnecting Nagios: Lessons Learned and Unexpected Benefits

By Splunk

In this final post of our Disconnecting Nagios series, we share the lessons learned from this process and unexpected benefits of moving to one, consolidated monitoring system.

Unexpected Benefits

When we started this process of evaluating Nagios, we knew we had to address a few key issues with the tool: inaccurate or late alerts, manual intervention for maintenance, and integration with our daily development and operational practices. We described the process of gaining confidence and consolidating into one monitoring system in alerts in our previous post. However, migrating from Nagios brought us a handful of additional benefits we were not expecting when we started the process.

Benefit #1: Understanding Overall Health at the Service-Level

Relying on one system actually also allowed us to view our system as a whole. We were able to view different types of alerts in together and make conclusions about the overall health of our system. Previously, relying on Nagios as an alerting system to restart something while relying on SignalFx to track other metrics made us miss how these two measures of health fit together. Once we switched to one monitoring system, we could see a bigger story by correlating across all these measures.

For example, we began to notice patterns from the downed host and service alerts. In one instance, we uncovered an OOM problem in one of our components when we were able to combine service down alerts with other metrics. Furthermore, we could make some conclusions about the robustness of our overall system and individual services based on how a single host or service being down was handled.

Benefit #2: Better, More Extensive Coverage

The ease of use of setting up new alerts in the SignalFx system, combined with the reliability of the alerts, allowed us to identify services not previously covered in Nagios. Instead of mistrusting downed service and host alerts, we now wanted to set up even more downed service alerts. For example, when we first setup replacements to ping checks and service checks, we only replaced what Nagios was covering. Once we did the replacement, we expected to have coverage for all services and quickly put that in place.

We also realized that the difficulty and inconvenience of configuring alerts in Nagios meant that our engineers avoided setting up necessary alerts and also avoided enhancing and expanding existing alerts. Because of SignalFx’s dimension-driven model, it only takes a few minutes to set up each alerts, which means that it takes an engineer very little effort to create an alert at the moment she realizes the importance of monitoring that specific service.

This is also true for expanding the features of an alert. For example, as described in the previous post, we integrated the process of putting alerts in downtime into our orchestration tool MaestroNG, which meant services intentionally stopped did not trigger unnecessary alerts. When we discovered that decommissioned Cassandra services also triggered alerts, we were able to quickly and easily add ‘service downtime’ into our decommissioning tools the same way we integrated downtime into MaestroNG. Looking back, our engineering would likely have accepted (and ignored!) the unnecessary alert if we continued using Nagios.

Benefit #3: Improved Existing Operations

Since our alert setup is so connected to our environment, it became imperative that our environment be completely correct. When we began setting up the SignalFx replacement alerts, we found a handful of issues, such as SaltStack and collectd misconfigurations for various services. We wouldn’t have caught and fixed these if we hadn’t tried to integrate monitoring with our environment so closely.

Lessons Learned

This process of disconnecting Nagios made us reflect not only on how we relied on certain alerts, but also more generally how we should manage our production environment.

Lesson #1: Don’t Let A Tool Limit What You Want To Accomplish

One of the biggest lessons we learned through this process is the importance of deciding what you want to accomplish and then figuring out how a tool can help you accomplish that goal. For the months following the launch of SignalFx, our engineering team kept Nagios around because we thought we had to — there were certain functionalities that we needed to have and there were no equivalent functionality in the monitoring platform we were building. However, this created more tasks that only complicated the process.

We initially thought that our two systems were doing different, mutually exclusive checks — that one could not replace the other. However, when we pushed ourselves to take a step back and really lay out what our goals were, we realized that we could accomplish all our goals in just one system. And carefully maintaining one system and adapting all checks to its methods and capabilities is far better than maintaining two systems.

This is often the case for a variety of tools: you think it’s doing one thing despite a lack of effort in updating its use. We ended up not only neglecting Nagios but also ignoring it, which ended up hurting us in the long run. Why even put in a minimal amount of time maintaining something you don’t want to bother with, especially if you know you’ve been neglecting it and because of that doubt its accuracy?

Lesson #2: Don’t Wait To Do Something Tomorrow That You Can Do Today

It’s easy to make excuses when you’re ramping up quickly and running in a million different directions — everyone assumes there will always be time later for you to deal with learning and correctly integrating new tools. But in reality, very few (SignalFx included!) ever get around to fixing the issues in that new tool that you know is hindering the team.

Reflecting on the process, we learned that it’s imperative that any tool used for any purposed should be fully integrated into the development team’s process (and preferable via the accepted tools used to perform various tasks). By leaving Nagios configuration of downtime as a separate, manual step, we frequently forgot to do the configuration. Furthermore, by leaving this step as external to our development of a particular component, many of our engineers got annoyed with having to take the extra steps to work with it. This caused us to have incomplete monitoring for some components.

If we had implemented a monitoring system as part of the development process from the beginning, we would not have needed to go through this entire process.

Lesson #3: Let Your Laziness Guide You

When all aspects of maintaining a service and host is in the same source code and directory, it’s much easier to spend your time focusing on what’s really important: building product.

By disconnecting Nagios and consolidating to one system, we felt confident on relying on a system fully integrated with the way we thought about our overall setup. Rather than spending time retrofitting an alert into an external monitoring system, we simply collected new metrics and added alerts for new indicators on the health of our components. This idea of creating alerts became integrated with our evaluation of overall component health. Furthermore, we configured every service as they entered our environment, not after the fact.

Lesson #4: Seeing The Forest Through The Trees

This process of looking at all types of metrics together made us reflect on the alerts we were relying on to manage our production environment. The types of alerts in place with Nagios tended to illustrate individual pieces of a broader system. Consequently, we could get misled into looking at the environment as individual pieces of hardware or software, which distracted us from looking at how a service or host being up or down acted as one symptom in conjunction with many others. This distracted us from looking at the system overall and from what was actually important to the system.

Replacing Nagios caused us to step back and think about the types of issues we really needed to be alerted for. In other words, trying to replicate a check on whether a host or service is up or down caused us to reconsider why this sort of check was important and how it needed to be put into context with other symptoms.

After consolidation, we began to use this host and service down alerts as part of the analysis for our system as a whole. We charted and tracked service-level patterns and trends over time, rather than looking at just one signal to restart something when hardware or software failed.

Conclusion

This process made us evaluate the capabilities required of a monitoring system and what it means to structure alerts that are accurate, timely, and actionable. While we’ve taken this first step in auditing Nagios and consolidating to one monitoring system, we’re constantly evaluating improvements more broadly in how we manage and operate our evolving environment. We look forward to sharing what we do next!

----------------------------------------------------
Thanks!
Anne Ustach

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.