In this blog post, we discuss using Telegraf as your core metrics collection platform with the Splunk App for Infrastructure (SAI) version 2.0, the latest version of Splunk’s infrastructure monitoring app that was recently announced at Splunk .conf19.
This blog post assumes you already have some familiarity with Telegraf and Splunk. We provided steps and examples to make sense of everything along the way, and there are also links to resources for more advanced workflows and considerations.
Telegraf is a metrics collection engine that runs on virtually any platform. It can collect metrics from virtually any source, and more inputs are being added pretty regularly. Most importantly, as of version 1.8.0, Telegraf can send metrics directly to your Splunk platform deployment.
Telegraf is a modular system that allows you to define inputs, processors, aggregators, serializers, and outputs. Inputs, as you would expect, are the sources of metrics. Processors and aggregators are internal methods that allow you to rename things, build internal aggregations, and define almost as many other user-defined customizations as you want. Serializers and outputs are where the magic happens: they define the format of the output data, and where and how to send it.
Version 1.8.0 includes a splunkmetric serializer. Configure the serializer to take metrics data from the Telegraf internal structure and format everything so it’s compatible with Splunk’s metrics format. You define the serializer in the [[output]] stanza. This lets you format your metrics in different ways for different destinations.
There are two ways to send metrics data from Telegraf to Splunk:
Before we talk about configuring outputs, we need to configure the splunkmetric serializer to properly format the metrics data before sending everything to Splunk.
To enable the splunkmetric serializer on a supported output configuration, set the following: data_format = “splunkmetric”
This configuration tells Telegraf that all metrics data the output sends will be in a Splunk-compatible format. The data format looks like this:
{
"_value": 0.6,
"cpu": "cpu0",
"dc": "mobile",
"metric_name": "cpu.usage_user",
"user": "ronnocol",
"time": 1529708430
}
Specifying the data_format works great for sending everything to either a Splunk Universal Forwarder or Heavy Forwarder. If you decide to send data to Splunk by writing to the HEC, you need to wrap the event in a bit of metadata. To tell Telegraf want to output a format that’s compatible with the HEC, set the following: splunkmetric_hec_routing = true
This setting modifies the JSON so that important fields such as time and host are in a wrapper around the event itself. The resulting data looks like this:
{
"time": 1529708430,
"event": "metric",
"host": "patas-mbp",
"fields": {
"_value": 0.6,
"cpu": "cpu0",
"dc": "mobile",
"metric_name": "cpu.usage_user",
"user": "ronnocol"
}
}
Now that we know how to enable the splunkmetric serializer for either output, and what the outputs look like for each configuration, let’s configure the output.
This is the output that Telegraf uses to write metrics data to a file. Configure your Splunk Universal Forwarder to monitor that file. It will work just like you’re monitoring some system log files.
The output stanza looks something like this:
[[outputs.file]]
## Files to write to, "stdout" is a specially handled file.
files = ["/tmp/metrics.out"]
## Data format to output.
data_format = "splunkmetric"
You’ll need to make sure to associate the output file with a metrics source type wherever you’re indexing your Splunk data. Create a props.conf stanza on your Splunk Indexer or Splunk Universal Forwarder, wherever you send the metrics data to first:
[telegraf]
category = Metrics
description = Telegraf Metrics
pulldown_type = 1
DATETIME_CONFIG =
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = true
disabled = false
INDEXED_EXTRACTIONS = json
KV_MODE = none
TIMESTAMP_FIELDS = time
Next you set the source type to telegraf in the inputs.conf stanza on the Splunk Universal Forwarder. If you’re going to use this type of configuration, you should also set up an appropriate log rotate policy to prevent your disks from filling up.
You could have an inputs.conf stanza that looks like this to process the metrics data file from Telegraf:
[monitor:///tmp/metrics.out]
index = telegraf_metrics
sourcetype = telegraf
Other ways to use this output are to have it output to stdout when launching Telegraf as a scripted input. This is what is done at TiVo. We’ll describe this in more detail in a future blog post.
This is the output that Telegraf uses to write metrics data to HEC. Configuring Telegraf to output directly to the HEC is not quite as straightforward as using the file-based outputs configuration because you have to deal with authentication using HEC tokens. Fortunately, the Telegraf HTTP output gives us the tools we need to make this work.
Before starting down this road, you’re going to need a couple pieces of information from your Splunk administrator:
This is what an [[outputs.http]] stanza should look like:
[[outputs.http]]
url = "https://localhost:8088/services/collector"
# insecure_skip_verify = false
## Data format to output.
data_format = "splunkmetric"
## Provides time, index, source overrides for the HEC
splunkmetric_hec_routing = true
## Additional HTTP headers
[outputs.http.headers]
Content-Type = "application/json"
Authorization = "Splunk xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
X-Splunk-Request-Channel = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
We removed most of the comments from the stanza so we could focus on the important parts, but the HTTP output Telegraf provides has info about how to deal with HTTP basic authentication. The comments in the default Telegraf configuration file are exhaustive, and should help you configure settings for any type of security requirement you may have. Configure the HEC endpoint in the url stanza. Because we’re sending metrics events, you want to make sure you’re not using the raw endpoint.
You’ll need to set the following: data_format = “splunkmetric”
And then enable the HEC format by setting this: splunkmetric_hec_routing = true
The other important info you got from your Splunk administrator is the HEC token. In the example above, you would replace the strings of x’s with that HEC token. You’ll need to set this information in the [outputs.http.headers] stanza so Telegraf knows to attach this data in the headers of every event it sends to the HEC.
One of the nice things about outputting directly to the HEC is you can now perform metrics collection on systems where you don’t already or can’t run a Splunk Universal Forwarder, such as systems Splunk doesn’t support or small form factor computers like a Raspberry Pi.
Now that you have your data flowing into Splunk with either the HEC or a Splunk Universal Forwarder, you’ll want to be able to turn those metrics into usable eye candy.
When Splunk introduced the metrics store, they also add two (2) SPL commands to help you access the metrics data. The commands are mstats and mcatalog. While I don’t plan on making this post an exhaustive lesson on these commands, this example shows that drawing the CPU graph above is as simple as using some SPL:
| mstats sum(cpu.usage_idle) as usage_idle, sum(cpu.usage_iowait) as usage_iowait, sum(cpu.usage_irq) as usage_irq, sum(cpu.usage_nice) as usage_nice, sum(cpu.usage_softirq) as usage_softirq, sum(cpu.usage_steal) as usage_steal, sum(cpu.usage_system) as usage_system, sum(cpu.usage_user) as usage_user WHERE cpu!="cpu-total" AND (index="telegraf" OR index="metrics") host=ronnocol.tivo.com span=30s
| timechart minspan=30s bins=2000 partial=f avg(usage_idle) as Idle, avg(usage_nice) as Nice, avg(usage_user) as User, avg(usage_irq) as Irq, avg(usage_softirq) as SoftIrq, avg(usage_iowait) as IoWait, avg(usage_steal) as Steal, avg(usage_system) as System
The newest version of SAI, version 2.0.0, which Splunk announced at .conf19, includes Telegraf-specific entity discovery and dashboards. Telegraf is treated the same as other metrics collectors (e.g. collectd) in SAI. Entities are auto-discovered, appropriate graphs are drawn in the Entity Overview, and potentially interesting graphs are pre-populated in the Analysis Workspace. You can set alerts, groups, etc. with your Telegraf-based nodes just like you can with any of the other SAI- supported collection engines.
There’s only one modification that’s required to monitor Telegraf metrics with SAI: prepend all of telegraf metric names with “telegraf.”
Set the following in telegraf.conf:
[[processors.override]]
name_prefix = "telegraf."
This allows SAI to know that the source of the metrics is Telegraf and to configure entity discovery and out of-the-box dashboards accordingly.
That’s it, that’s all there is to it. Now that Telegraf is prefixing all of the metric names with telegraf., your devices will show up in the SAI entities list. Those two lines provide you with wonderful prebuilt charts like these:
Here’s a sample config in use at TiVo to collect machine metrics with Telegraf and send them to Splunk for monitoring in SAI:
[global_tags]
telegraf-profile = "sai-default"
[agent]
interval = "30s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
debug = false
quiet = true
logfile = ""
hostname = ""
omit_hostname = false
[[outputs.file]]
files = ["stdout"]
data_format = "splunkmetric"
[[processors.override]]
name_prefix = "telegraf."
[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false
report_active = false
fieldpass = ["usage_idle","usage_iowait","usage_irq","usage_nice","usage_softirq","usage_steal","usage_system","usage_user"]
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "overlay", "aufs", "squashfs"]
fielddrop = ["inodes*"]
[[inputs.diskio]]
[[inputs.kernel]]
fielddrop = ["boot_time"]
[[inputs.mem]]
fielddrop = ["high*","low*","huge_page*","commit*","dirty","inactive","wired"]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.kernel_vmstat]]
fieldpass = ["pgpgin", "pgpgout", "pswpin", "pswpout", "pgfault"]
[[inputs.net]]
ignore_protocol_stats = true
[[inputs.netstat]]
Telegraf is a highly-configurable metrics collector that runs on a variety of platforms, collects metrics from a variety of sources, and allows you to use that data in Splunk. With the release of SAI 2.0.0, you can even get all of the great functionality that SAI provides for every other integration on your Telegraf nodes.
For further information, check out the following resources:
The telegraf integration with Splunk App for Infrastructure is supported as part of the open source Splunk metrics serializer project. For questions regarding setup and management of telegraf for sending data to Splunk please see the metrics serializer section of the telegraf project.
You can also ask any questions in the splunk-usergroups Slack workspace. Information about signing up can be found, here. Look for the #it-infra-monitoring channel.
This post was written primarily by Lance O'Connor, Principal Architect at TiVo, and Nick Tankersley, Principal Product Manager at Splunk, tagged along for the ride.
----------------------------------------------------
Thanks!
Nick Tankersley
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.