Splunk is committed to using inclusive and unbiased language. This blog post might contain terminology that we no longer use. For more information on our updated terminology and our stance on biased language, please visit our blog post. We appreciate your understanding as we work towards making our community more inclusive for everyone.
If you have spent any time searching in Splunk, you have likely done at least one search using the stats command. I won’t belabor the point: stats is a crucial capability in the context of threat hunting — it would be a crime to not talk about it in this series.
When focusing on data sets of interest, it's very easy to use the stats command to perform calculations on any of the returned field values to derive additional information. When I say stats, I am referring to three commands:
Like many Splunk commands, all three are transformational commands, meaning they take a result set and perform functions on the data.
Let’s dive into stats.
(Part of our Threat Hunting with Splunk series, this article was originally written by John Stoner. We’ve updated it recently to maximize your value.)
The stats command is a fundamental Splunk command. It will perform any number of statistical functions on a field, which could be as simple as a count or average, or something more advanced like a percentile or standard deviation.
Using the keyword by within the stats command can group the statistical calculation based on the field or fields listed.
Here is a good basic example of how to apply the stats command during hunting. I might hypothesize that the source destination pairs with the largest amount of connections starting in a specific netblock are of interest to dig deeper into.
sourcetype=fgt_traffic src=192.168.225.* NOT (dest=192.168.* OR dest=10.* OR dest=8.8.4.4 OR dest=8.8.8.8 OR dest=224.*) | stats count by src dest | where count > 1 | sort – count
The search is looking at the firewall data originating from the 192.168.225.0/24 netblock and going to destinations that are not internal or DNS. The stats command is generating a count, grouped by source and destination address. Once the count is generated, that output can be manipulated to get rid of single events and then sorted from largest to smallest.
Another use for stats is to sum values together. A hypothesis might be to look at firewall traffic to understand who my top talkers to external hosts are, not from a connection perspective, but from a byte perspective. Using the stats command, multiple fields can be calculated, renamed and grouped.
sourcetype=fgt_traffic src=192.168.225.* NOT (dest=192.168.* OR dest=10.* OR dest=8.8.4.4 OR dest=8.8.8.8 OR dest=224.*) | stats sum(bytes_in) as total_bytes_in sum(bytes_out) as total_bytes_out by src dest | table src dest total_bytes_in total_bytes_out | sort – total_bytes_out
In this example, the same data sets are used, but this time the stats command is used to sum the bytes_in and bytes_out fields. By changing the sort, I can easily pivot to look at things like:
As a side note, if I saw the result set above I might ask why I am seeing many hosts from the same subnet all communicating to the same destination IP, with identical byte counts, both in and out. The point is that there are numerous ways to leverage stats.
With these fundamentals in place, let’s apply these concepts to eventstats. I like to think of eventstats as a method to calculate “grand totals” within a result set that can then be used to manipulate these totals to introspect the data set further.
Another hypothesis I might want to pursue is identifying and investigating the systems with the largest byte counts leaving the network. To effectively hunt, I want to know two key things:
Using the same basic search criteria as the earlier search, we slightly augment it to make sure any bytes_out are not zero to keep the result set cleaner. Eventstats is calculating the sum of the bytes_out and renaming it total_bytes_out grouped by source IP address. That output can then be treated as a field value that can be outputted with additional Splunk commands.
sourcetype=fgt_traffic src=192.168.225.* NOT (dest=192.168.* OR dest=10.* OR dest=8.8.4.4 OR dest=8.8.8.8 OR dest=224.*) bytes_out>0 | eventstats sum(bytes_out) AS total_bytes_out by src | table src dest bytes_out total_bytes_out | sort src – bytes_out
The bands highlighted in red show the source IP address with the bytes_out summed to equal the total_bytes_out.
Another hypothesis that I could pursue using eventstats would be to look for systems that have more than 60% of their traffic going to a single destination. If a system is talking nearly exclusively to a single external host, that might be cause for concern or at least an opportunity to investigate further.
Going back to the earlier example that looked for large volumes of bytes_out by source and destination IP addresses, we could evolve this and use eventstats to look at the bytes_out by source as a percentage of the total byte volume going to a specific destination.
sourcetype=fgt_traffic src=192.168.225.* NOT (dest=192.168.* OR dest=10.* OR dest=8.8.4.4 OR dest=8.8.8.8 OR dest=224.*) | eventstats sum(bytes_out) AS total_bytes_out by src | eval percent_bytes_out = bytes_out/total_bytes_out * 100 | table src dest bytes_in bytes_out total_bytes_out percent_bytes_out | where percent_bytes_out > 60 | sort - percent_bytes_out dest
Building on the previous search criteria, I calculate the eventstats by summing the bytes_out grouped by source IP address to get that “grand total.”
Now I can start transforming that data using stats like I did earlier and grouping by source and destination IP. If I stopped there, I would have the sum of the bytes_in, bytes_out, the total_bytes_out and the source and destination IP.
That’s great — but I need to filter down on the outliers that I'm hypothesizing about.
Using the eval command, the bytes_out and total_bytes_out can be used to calculate a percentage of the overall traffic. At that point, I'm formatting the data using the table command and then filtering down on the percentages that are greater than 60 and sorting the output.
I now have a set of source IP addresses that I can continue to interrogate with the knowledge that a high percentage of the data is going to a single destination. In fact, when I look at my output, I find an interesting outcome which is that my top 14 source addresses are all communicating to the same external IP address.
That alone might be something interesting to dig further on, or it might be a destination that should be whitelisted using a lookup. This approach though allows me to further refine my search and reinforce or disprove my hypothesis.
On to streamstats. Streamstats builds upon the basics of the stats command but it provides a way for statistics to be generated as each event is seen. This can be very useful for things like running totals or looking for averages as data is coming into the result set.
If I were to take the results from our earlier hunt, I could further hypothesize that communications outbound from my host occur in bursts. I could then use streamstats to visualize and confirm that hypothesis.
sourcetype=fgt_traffic src=192.168.225.* NOT (dest=192.168.* OR dest=10.* OR dest=8.8.4.4 OR dest=8.8.8.8 OR dest=224.*) bytes_out>0 | sort date | streamstats sum(bytes_out) as total_bytes_out by src | table date bytes_out total_bytes_out
Building off the previous example, the source IP address 192.168.225.80 generated 77% of its traffic to a specific destination. We could investigate further and look at the data volume over time originating from that address.
The search I start with is the same basic search as the other examples with one exception — the source is no longer a range but a specific address.
Because I would like the information to aggregate on a daily basis, I'm sorting by date. Streamstats is then used to get the sum of the bytes_out, renamed as total_bytes_out and grouped by source IP address. Finally, we table the output, specifically date, bytes_out and the total_bytes_out.
The output can be viewed in a tabular format or visualized, preferably as a line chart or area chart. As you can see from the output, the daily bytes_out added to the previous day’s total_bytes_out will equal today’s total_bytes_out.
Stats, eventstats and streamstats are all very powerful tools that can refine the result set to identify outliers within the environment. While this blog focused on network traffic and used sums and counts, there is no reason not to use it for host-based analysis as well as leveraging statistics like standard deviations, medians and percentiles.
Happy hunting!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.