Splunk is committed to using inclusive and unbiased language. This blog post might contain terminology that we no longer use. For more information on our updated terminology and our stance on biased language, please visit our blog post. We appreciate your understanding as we work towards making our community more inclusive for everyone.
The Splunk Metrics Store offers users a highly scalable, blazingly fast way to ingest and search metrics across their environments. There are many ways of generating metrics and sending them to Splunk, including both the collectd and statd agents, but this post will focus on Telegraf as a means to achieve this. For more information on the Splunk Metrics Store and why you should be using it, check out "Metrics to the Max! Dramatic Performance Improvements for Monitoring and Alerting on Metrics Data."
Telegraf is a widely used agent—written in Go—for collecting, processing, aggregating, and writing metrics. It’s awesome and supports inputs from everything from SQL server to Minecraft. It's an entirely plugin-driven platform for collecting metrics. It's platform agnostic, with the capability to run on most commonly-run operating systems. This post will focus on running the platform on nix variants; a follow-up blog will focus on running Telegraf on Windows.
The following is based on the amazing work from the team at TiVo, especially Lance O'Connor. See this page on GitHub for much more information.
The design goals for Telegraf are to have a minimal memory footprint with a plugin system so that developers in the community can easily add support for collecting metrics.
Telegraf is plugin-driven and has the concept of four distinct plugins types:
There are many benefits of Telegraf, including the fact that plugins are integrated into the core (this means no competing plugins for the same tech), native support for dimensions/tags, good Docker support and an active support community.
To install Telegraf into a nix host, follow these simple steps:
Yum Install go
Yum Install def
Yum Install git
cd "$HOME/go/src/github.com/influxdata/telegraf”
make
From "$HOME/go/src/github.com/influxdata/telegraf”
./telegraf config > telegraf.conf
// To generate a config
./telegraf --config telegraf.conf —test
// Test it out
./telegraf --config telegraf.conf
// Run it
./telegraf --config splunk.conf --input-filter cpu:mem --output-filter http
./telegraf --config splunk.conf --output-filter http
This serializer formats and outputs the metric data in a format that can be consumed by a Splunk metrics index. It can be used to write to a file using the file output, or for sending metrics to a HEC using the standard telegraf HTTP output.
If you're using the HTTP output, this serializer knows how to batch the metrics so you don't end up with an HTTP POST per metric.
An example config to shoot metrics to the HTTP Event Collector would look like this:
[[outputs.http]] ## URL is the address to send metrics to url = "https://x.x.x.x:8088/services/collector" ## Timeout for HTTP message # timeout = "5s" ## HTTP method, one of: "POST" or "PUT" # method = "POST" ## HTTP Basic Auth credentials # username = "username" # password = "pa$$word" ## Optional TLS Config # tls_ca = "/etc/telegraf/ca.pem" # tls_cert = "/etc/telegraf/cert.pem" # tls_key = "/etc/telegraf/key.pem" ## Use TLS but skip chain & host verification # insecure_skip_verify = false ## Data format to output. ## Each data format has it's own unique set of configuration options, read ## more about them here: ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md data_format = "splunkmetric" ## Provides time, index, source overrides for the HEC splunkmetric_hec_routing = true ## Additional HTTP headers [outputs.http.headers] # Should be set manually to "application/json" for json data_format Content-Type = "application/json" Authorization = "Splunk f8xxxxd3-4xx1-4xx2-aeda-86xxxxxb36c" X-Splunk-Request-Channel = "f8xxxxx3-4xx1-4xx2-aeda-8xxxxxxx6c"
Then, look to customize some bits that differ from global Telegraf settings, such as setting the index you’d like to send a certain metric to:
[[inputs.cpu]] percpu = false totalcpu = true [inputs.cpu.tags] index = "cpu_metrics"
This setup will result in metrics that look like:
{ "time": 1529708430, "event": "metric", "host": "patas-mbp", "fields": { "_value": 0.6, "cpu": "cpu0", "dc": "mobile", "metric_name": "cpu.usage_user", "user": "ronnocol" } }
In this example, cpu, dc and user are dimensions of the one metric.
An alternative to using HEC is to output Telegraf metrics to file, using an output configuration such as:
# Send telegraf metrics to file(s) [[outputs.file]] ## Files to write to, "stdout" is a specially handled file. files = ["/tmp/metrics.out"]
## Data format to output. ## Each data format has its own unique set of configuration options, read ## more about them here ##https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md data_format = "splunkmetric" hec_routing = false
Then, using the Splunk Universal Forwarder, you'll be able to tail this file and send the metrics direct to an indexer. A sample event using this configuration is as follows:
{ "_value": 0.6, "cpu": "cpu0", "dc": "mobile", "metric_name": "cpu.usage_user", "user": "ronnocol", "time": 1529708430 }
Use this example props.conf to format the metrics correctly:
[telegraf] category = Metrics description = Telegraf Metrics pulldown_type = 1 DATETIME_CONFIG = NO_BINARY_CHECK = true SHOULD_LINEMERGE = true disabled = false INDEXED_EXTRACTIONS = json KV_MODE = none TIMESTAMP_FIELDS = time TIME_FORMAT = %s.%3N
If you are looking to leverage telegraf with The Splunk App for Infrastructure, the following updates to your telegraf.conf file will make system metrics 100% compatible with the App.
[global_tags] [agent] interval = "10s" round_interval = true metric_batch_size = 1000 metric_buffer_limit = 10000 collection_jitter = "0s" flush_interval = "10s" flush_jitter = "0s" precision = "" debug = false quiet = false logfile = "" hostname = "" omit_hostname = false [[outputs.file]] files = ["stdout", "/tmp/metrics.out"] data_format = "splunkmetric" [[processors.rename]] [[processors.rename.replace]] field = "usage_idle" dest = "idle" [[processors.rename.replace]] field = "usage_interrupt" dest = "interrupt" [[processors.rename.replace]] field = "usage_nice" dest = "nice" [[processors.rename.replace]] field = "usage_softirq" dest = "softirq" [[processors.rename.replace]] field = "usage_steal" dest = "steal" [[processors.rename.replace]] field = "usage_system" dest = "system" [[processors.rename.replace]] field = "usage_user" dest = "user" [[processors.rename.replace]] field = "usage_wait" dest = "wait" [[processors.rename.replace]] field = "usage_guest" dest = "guest" [[processors.rename.replace]] field = "usage_guest_nice" dest = "guest_nice" [[processors.rename.replace]] field = "usage_iowait" dest = "wait" [[processors.rename.replace]] field = "usage_irq" dest = "interrupt" [[processors.rename.replace]] field = "io_time" dest = "io_time.io_time" [[processors.rename.replace]] field = "weighted_io_time" dest = "io_time.weighted_io_time" [[processors.rename.replace]] field = "read_time" dest = "time.read" [[processors.rename.replace]] field = "write_time" dest = "time.wrie" [[processors.rename.replace]] field = "reads" dest = "ops.read" [[processors.rename.replace]] field = "write" dest = "ops.write" [[processors.rename.replace]] field = "iops_in_progress" dest = "pending_operations" [[processors.rename.replace]] field = "read_bytes" dest = "octets.read" [[processors.rename.replace]] field = "write_bytes" dest = "octets.write" [[processors.rename.replace]] field = "bytes_recv" dest = "octets.rx" [[processors.rename.replace]] field = "bytes_sent" dest = "octets.tx" [[processors.rename.replace]] field = "drop_in" dest = "dropped.rx" [[processors.rename.replace]] field = "drop_out" dest = "dropped.tx" [[processors.rename.replace]] field = "err_in" dest = "errors.rx" [[processors.rename.replace]] field = "err_out" dest = "errors.tx" [[processors.rename.replace]] field = "packets_recv" dest = "packets.rx" [[processors.rename.replace]] field = "packets_sent" dest = "packets.tx" [[processors.rename.replace]] field = "load1" dest = "shortterm" [[processors.rename.replace]] field = "load5" dest = "midterm" [[processors.rename.replace]] field = "load15" dest = "longterm" [[inputs.cpu]] percpu = true [[inputs.disk]] name_override = "df" [[inputs.diskio]] name_override = "disk" [[inputs.mem]] name_override="memory" [[inputs.system]] name_override="load"
Check out the Splunk App for Infrastructure, and shout out to Splunker Nick Tankersley for providing the renames.
Of course, you should also check out the new logs to metrics interface in Splunk Enterprise 7.2, as well as some of the other new capabilities to search metrics via the Metrics Workbench!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.