MLOps - Logs, Metrics and Traces to improve your Machine Learning Systems

By Philipp Drieger

Once you’ve reached the point where you want to deploy your machine learning models to production, you will eventually need to monitor operations and performance. You might also want to receive alerts in case of any unexpected behavior or inconsistencies with your model or your data quality. This is where you most likely start learning about various aspects of Machine Learning Operations (MLOps). The term “MLOps” derives from the words machine learning (ML), development (DEV) and operations (OPS) and was coined to describe a “practice for collaboration and communication between data scientists and operations professionals to help manage production ML (or deep learning) lifecycle” according to Wikipedia.

The real challenge in production grade machine learning systems is not always the actual ML code itself, but more to do with the surrounding components that make up the whole system. The NIPS paper, “Hidden Technical Debt in Machine Learning Systems” summarizes this concept in the figure below:

Machine Learning Systems

Logs and Metrics in MLOps

As a market leader in IT Operations, Splunk is widely used for collecting logs and metrics of various IT components and systems such as networks, servers, middleware, applications and generally any IT service stack. When you build and run a machine learning system in production, you probably also rely on some (cloud) infrastructure, components, services, and application code which provide various logs and metrics. Those data sources enable you to perform root cause analysis during development and in production, allowing you to analyze what exactly occurs when things go wrong or break. In addition, you can also continuously monitor all that’s happening and proactively get alerted on deviations from the expected behavior of your ML system, e.g. you face a severe model degradation or your data quality for (re)training or inference changed dramatically. You may also want to lower your service costs or optimize your infrastructure as well as your actual model code making it more robust or run more efficiently. This is where Splunk’s Data-To-Everything Platform can play to its full strength in providing you with answers to all sorts of questions you might have around your machine learning operations.

Continuously Monitor your Model Metrics

To give you an example, let’s assume you defined a machine learning model that is continuously applied to new data to provide an analyst with some results. The model will automatically get retrained on a certain schedule in order to stay up-to-date on the latest data. In the following instance, we use Splunk’s Machine Learning Toolkit to retrain a classifier model and generate simple scoring metrics for the training data. This can help keep track of the model’s key performance indicators that we want to see within our expected limits. It’s fairly easy to wrap this into a Splunk dashboard that you can check at any time to quickly inspect your model status and proactively alert you on any model issues.

Machine Learning Operations

For this example you can easily collect all those KPIs into a Splunk metrics store with a bit of SPL:

...

index=my_latest_data partition="test"

| apply usermodel as predicted_user

| multireport

    [ | score confusion_matrix user against predicted_user | untable Label Operation Score | eval Label="confusion_matrix.".Label ]

    [ | score precision_recall_fscore_support user against predicted_user | untable Metric Operation Score | rename Metric as Label | eval Label="score.".Label ]

    [ | summary usermodel | eval Label = "feature_importance" | rename feature as Operation importance as Score | table Label Operation Score | eval Operation=replace(Operation,"=","#")]

    | rename Label as metric_name Operation as operation Score as _value

    | eval metric_name = "mlops."."username.".metric_name

    | eval _time=now()

    | table _time metric_name operation _value

    | mcollect index=my_model_metrics

Of course, you can take a lot more data sources into account and correlate with one another to adapt them in the best way to meet the requirements and SLAs of your machine learning solution.

Observability and Tracing

To add even more observability of your ML pipelines and systems, deep instrumentation, and code tracing you can use SignalFx. There are many existing integrations for various components of ML architectures like Spark, Kafka, databases, and cloud specific services. This allows you to improve any bottlenecks or performance issues your overall system might have. When it comes to complicated and distributed ML workloads and code execution in distributed container infrastructures, traceability is key. Traceability helps to identify the real hot spots and root causes, and optimizes your code and system performance accordingly.

The SignalFx platform is the only real time cloud monitoring and observability platform that fully embraces open standards. Splunk is a main contributor to OpenTelemetry supporting vendor agnostic instrumentation and data collection including typical MLOps stacks such as Java and Python. And last but not least, when you run intense ML workloads, cost matters But the good news is that SignalFx has out-of-the-box capabilities to optimize and control your cloud spend.

Happy Splunking,

Philipp

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.