This is a pretty common question in Splunkland. Maybe you’re an admin wondering how much license you’ll need to handle this new data source you have in mind for a great new use case. Or you’re a Splunker trying to answer this question for a customer. Or a partner doing the same. Given how often this comes up, I thought I’d put together an overview of all the ways you can approximate how big a license you need, based on a set of data sources. This post brings together the accumulated wisdom of many of my fellow sales engineers, so before I begin, I’d like to thank them all for the insights and code they shared so willingly. Thank you for your help, everyone!
All right. So, broadly, you have these options:
- Ask someone how much data there is
- Measure the data
- At the original source
- Outside the original data source
- Estimate the size of the data
- Bottom-up, based on “samples” of data from various sources
- Top-down, based on total data volumes of similar organizations
Let’s take a look at each of these in turn.
Ask someone how much data there is
Doesn’t hurt, right? Sometimes admins will actually have some of this information. Many times they will not. If they do, one approach is to take these rough estimates, add a buffer, and use that for your license size. The buffer helps account for inaccurate measurements (people make mistakes), out-of-date measurements, incomplete measurements and changing environments. This is actually a viable option if your environment is relatively static, homogeneous across sites, and can’t be measured for some reason – technical or logistical. The buffer also gives you some room to grow as you start to get comfortable with analyzing machine data. Typically, when people see what they can start to do with Splunk, they like to do more!
Where we have seen this start to break down is in some very large organizations with data all over the world. If you are looking at massive amounts of data – terabytes that is unevenly distributed between different sites, rough numbers from one site will not cut it. It might not be representative of the others. Add to that the data that hasn’t been factored into the old numbers because they only measured file sizes, which were the easiest, yet there’s more that’s collected via APIs (i.e doesn’t exist until you do the pull). And more that isn’t in a log file at all (performance data constantly streaming in, for example). Rough numbers don’t quite work in this situation.
If you do need to go beyond the rough numbers or have none, you get into the options below – measuring the data properly, or estimating it.
Measure the data
If you can do it economically and well, measuring is the best option you have. It gives you the best chance of coming up with a license that is neither oversized nor undersized. Some of the variables in the Estimation section below should be kept in mind, however, as you want to be sure you’re getting a measurement that’s truly representative.
So how do you actually do this measurement?
Measuring at the source
If you’re measuring at the original source, the exact way you’ll do this depends on the source itself. Considering the vast universe of data sources that Splunk can take in, there really isn’t a quick trick that works for everything, unfortunately. That isn’t to say it can’t be done at the source – far from it – there are often easy-to-access ways to find this information. Native tools work just fine for files on disk. Just make sure you’re measuring uncompressed data. If you happen to have a collection of HOWTOs for different platforms, I’d love to see them.
Measuring at the source can have several advantages. It can be very accurate (read the estimation section below to see why I say “can”). You don’t have to set up a way to pull or push the data to something else. You don’t have to think about discarding the data after it’s been measured. You do have to measure it individually everywhere, however, which can get out of hand if you have a lot of sources of similar types, say desktops. In those cases, you can do some representative measurements, then extrapolate to the number of systems you have. This extrapolation can be less accurate than a true measurement obtained by collecting it all, but gets you closer.
Measuring outside the source
Since you’re usually trying to measure more than one data source, you can also consider sending them all to a central point where they’ll be collected, and you’ll get an idea of how different sources compare to each other. (You might find some surprises here). As you’d expect, Splunk can do these calculations easily. I’ll go into how shortly. It’s also something a syslog aggregator could do if your data is all coming in over syslog, for example.
If you decided to use a temporary Splunk install for this, there are a few things to keep in mind first.
Get a trial license
Splunk sales can generate a trial license for the duration of your test that can handle a large volume thrown at it. If you don’t have that option, a free Splunk license will also work, but you have to jump through a few hoops there that probably aren’t worth the effort. Since the trial option is so much easier, that’s by far the recommended approach.
Don’t touch it on the way in
Since we’re simply estimating the size of the data, there’s no reason to transform or parse it in any way before bringing it in. No props and transforms magic. At the very least, keep it to a bare minimum and just pull it in. If you’re doing this for a customer (you’re a Splunker or partner), it’s important to explain the process so everyone understands that this isn’t the start of a production install. When you get to production, you might want to do any number of things to your data on the way in – filter out some, mask portions of it, route it to certain indices, extract fields ahead of time. None of this is necessary when you’re simply counting bytes.
Make sure that retention on the “_internal” indexes is long enough for the estimation period.
The _internal indexes contain the information we’ll be looking at, so they have to stay around.
Throw it away afterwards
There’s no reason to keep the data around any longer than necessary. This isn’t production, and throwing it away also helps you get away with a lower class of hardware for this exercise. It won’t need to support consistently writing large volumes of data to disk and searching by multiple users, after all. To discard the data, route it to an index that doesn’t keep data around very long, i.e rolls it to frozen quickly. Say you’re routing it all to main and want to keep the data around no longer than a day (86400 seconds). Set this in your indexes.conf.
# Use a dummy index so you don't hose your main index in case you're doing this on an existing Splunk installation.
[devnull]
# Location of Hot/Warm Buckets
homePath = $SPLUNK_DB/devnull/db
# Location of Cold Buckets
coldPath = $SPLUNK_DB/devnull/colddb
# Location of Thawed Buckets
thawedPath = $SPLUNK_DB/devnull/thaweddb
# Maximum hot buckets that can exist per index.
maxHotBuckets = 1
# The maximum size in MB for a hot DB to reach before a roll to warm is triggered.
maxDataSize = 3
# The maximum number of warm buckets.
maxWarmDBCount = 1
# Roll the indexed data to frozen after 1 day. (86400 = 60 secs * 60 min * 24hours )
frozenTimePeriodInSecs = 86400
This will roll the data to frozen after 86400 seconds. If you haven’t specified a path for the frozen data, it will be deleted.
The maxTotalDataSizeMB parameter also controls when data is rolled to frozen, and does it based on the size of the data. Set it to a very large value to control the roll strictly by time. Learn about this behavior under
setting a retirement and archiving policy.
So, assuming you’ve done all this, you’re ready to start measuring. Use
the searches at the bottom of this page to do that. Lots of goodies there…modify as you need to.
Estimate the size of the data
As a last resort, you can fall back to estimating the data.
Warning – this method can be inaccurate and make people unhappy if the resulting license sizes turn out to be too small or big. When relaying the information, please communicate that though you’re doing your *very best* to estimate, the practice comes with caveats they should keep in mind. Seeing hard numbers can lead people to think there was more science behind them than there was. If you don’t believe that (but you probably do), go read
How to Lie with Statistics by Darrell Huff. It’ll be a bit of an eye-opener.
There are many good reasons to fall back on estimates. Maybe you’re in a time crunch and don’t have the time to measure things. Or there are no resources to do so. It could also be a question of budgets – this is how much I can afford at this time. And sometimes, the data is sensitive (or you aren’t sure) and there’s no will to temporarily pull it out to a centralized collector like Splunk to measure – even if you’re doing it on-premise. In these situations, balance it out against the cost of estimating wrongly, and if you still want to go ahead, you have a couple options to estimate.
Estimating based on data samples
These samples can come from previous sample sizes Splunk has collected (which we have), ones the actual environment has collected (but not measured yet), from mobile apps like
LogCaliper, or from vendors of the data sources themselves. To use the Splunk-collected samples, Splunkers will already be aware of the options here. There are various spreadsheets and an app based upon the best of these. Partners – reach out. We also highly encourage submitting
anonymized data samples to improve the value of these samples, and we make it easy!
So why are samples something to be careful of? Simply put, they might not reflect your environment since they’re based on another one. Log volumes, even with the same product family, can vary widely depending on a number of factors:
- Logging severity levels. Say you’re talking about Cisco devices. There’s a big difference between the amount of information between the lowest (0 – emergency) and highest (7 -debugging, or more likely 6 – information) levels. Are you logging simply which flows are allowed/denied, or going up the chain to user activity?
- What the device is doing. Logs are a reflection of work performed. If this is a firewall, how much traffic is it seeing? The more it does, the more it will log, even if the severity level stays the same. Simply knowing the number of firewalls you have isn’t enough, though that’s often the first piece of information that comes back. How many flows are they seeing per second? Couple this with something like a netflow calculator if you’re looking at netflow, and you start to get somewhere.
- Services enabled.
- Types of events that are being logged. (Side note: Are you logging everything you should to help you in case of a security breach? Check out some excellent guidance from the NSA, and Malware Archaeology’s great talk at the 2015 Splunk .conf user conference.
- How many things get logged. Say it’s Tripwire data. Change notifications depend on how many Tripwire agents there are in the environment and what rules are set to fire against them.
- What’s IN a thing that gets logged. Say it’s Tripwire again. If you’re doing a lot with Windows GPO changes or AD changes, your change notifications get bigger.
- Custom logs. You can add your own pieces to existing logs sometimes, such as with a BlueCoat proxy.
- How long you’re measuring for. Volumes can ebb and flow over the course of a week.
- Where the data is coming from. Some data sources are just more talkative than others. Carbon Black, for example.
…And so on. You get the picture. There are more examples on this
Splunk Answer. (For completeness, I linked back to this post from there, so if you see a circular reference, you’re not imagining things). The point is – any sample you’re basing your estimate on is just that –
a sample. You can attempt to average them over time, which helps. Just remember that your own environment can still be different. If this is the best you can do, run with it. Decisions are made with less than perfect information every day.
Estimating based on the usage of similar-sized organizations
In contrast to above, this takes a top-down, wholesale look at your data needs as an organization. The hypothesis is that you has a similar setup to others your size, and similar data sizes. This is a faster way than any of the above, since it abstracts away all detail in favor of the assumption that you can’t be *that* different from the others. The challenge, of course, is in defining who those others are. This is something you will probably need consultative help from Splunk on. A certain level of sanity checking would be needed here to see if you really can use similar assumptions, even if you happen to be outwardly similar. You might operate quite differently.
Appendix: Searches for measuring data ingested
****************************************************************************
Using Raw Data Sizing and Custom Search Base
These searches use the len Splunk Search command to get the size of the raw
event using a custom base search for specific type of data.
****************************************************************************
NOTE: Just replace "EventCode" and "sourcetype" with values corresponding to the type of data that you are looking to measure.
=====================================================
Simple Searches:
=====================================================
Indexed Raw Data Size by host By Day:
-------------------------------------
sourcetype=WinEventLog:*
| fields _raw, _time, host
| eval evt_bytes = len(_raw)
| timechart span=1d sum(eval(evt_bytes/1024/1024)) AS TotalMB by host
Indexed Raw Data Size by sourcetype By Day:
-------------------------------------------
sourcetype=WinEventLog:*
| fields _raw, _time, sourcetype
| eval evt_bytes = len(_raw)
| timechart span=1d sum(eval(evt_bytes/1024/1024)) AS TotalMB by sourcetype
Indexed Raw Data Size by Windows EventCode By Day:
--------------------------------------------------
sourcetype=WinEventLog:*
| fields _raw, _time, EventCode
| eval evt_bytes = len(_raw)
| timechart span=1d limit=10 sum(eval(evt_bytes/1024/1024)) AS TotalMB byEventCode useother=false
Avg Event count/day, Avg bytes/day and Avg event size by sourcetype:
--------------------------------------------------------------------
index=_internal kb group="per_sourcetype_thruput"
| eval B = round((kb*1024),2)
| stats sum(ev) as count, sum(B) as B by series, date_mday
| eval aes = (B/count)
| stats avg(count) as AC, avg(B) as AB, avg(aes) as AES by series
| eval AB = round(AB,0)
| eval AC = round(AC,0)
| eval AES = round(AES,2)
| rename AB as "Avg bytes/day", AC as "Avg events/day", AES as "Avg event size"
Avg Event count/day, Avg bytes/day and Avg event size by source:
----------------------------------------------------------------
index=_internal kb group="per_source_thruput"
| eval B = round((kb*1024),2)
| stats sum(ev) as count, sum(B) as B by series, date_mday
| eval aes = (B/count)
| stats avg(count) as AC, avg(B) as AB, avg(aes) as AES by series
| eval AB = round(AB,0)
| eval AC = round(AC,0)
| eval AES = round(AES,2)
| rename AB as "Avg bytes/day", AC as "Avg events/day", AES as "Avg event size”
=====================================================
Combined Hosts and Sourcetypes:
=====================================================
Top 10 hosts and Top 5 sourcetypes for each host by Day:
--------------------------------------------------------
sourcetype=WinEventLog:*
| fields _raw, _time, host, sourcetype
| eval evt_bytes = len(_raw)
| eval day_period=strftime(_time, "%m/%d/%Y")
| stats sum(evt_bytes) AS TotalMB, count AS Total_Events by day_period,host,sourcetype
| sort day_period
| eval TotalMB=round(TotalMB/1024/1024,4)
| eval Total_Events_st=tostring(Total_Events,"commas")
| eval comb="| - (".round(TotalMB,2)." MB) for ".sourcetype." data"
| sort -TotalMB
| stats list(comb) AS subcomb, sum(TotalMB) AS TotalMB by host, day_period
| eval subcomb=mvindex(subcomb,0,4)
| mvcombine subcomb
| sort -TotalMB
| eval endcomb="|".host." (Total - ".round(TotalMB,2)."MB):".subcomb
| stats sum(TotalMB) AS Daily_Size_Total, list(endcomb) AS Details by day_period
| eval Daily_Size_Total=round(Daily_Size_Total,2)
| eval Details=mvindex(Details,0,9)
| makemv delim="|" Details
| sort-day_period
Top 10 Hosts and Top 5 Windows Event IDs by Day:
--------------------------------------------------------
sourcetype=WinEventLog:*
| fields _raw, _time, host, EventCode
| eval evt_bytes = len(_raw)
| eval day_period=strftime(_time, "%m/%d/%Y")
| stats sum(evt_bytes) AS TotalMB, count AS Total_Events by day_period,host,EventCode
| sort day_period
| eval TotalMB=round(TotalMB/1024/1024,4)
| eval Total_Events_st=tostring(Total_Events,"commas")
| eval comb="| - (".round(TotalMB,2)." MB) for EventID- ".EventCode." data"
| sort -TotalMB
| stats list(comb) AS subcomb, sum(TotalMB) AS TotalMB by host, day_period
| eval subcomb=mvindex(subcomb,0,4)
| mvcombine subcomb
| sort -TotalMB
| eval endcomb="|".host." (Total - ".round(TotalMB,2)."MB):".subcomb
| stats sum(TotalMB) AS Daily_Size_Total, list(endcomb) AS Details by day_period
| eval Daily_Size_Total=round(Daily_Size_Total,2)
| eval Details=mvindex(Details,0,9)
| makemv delim="|" Details
| sort-day_period
**************************************************************************
Licensing/Storage Metrics Source
The below searches look against the internally collected licensing/metrics
logs and introspection index. These are in license_usage.log, which is
indexed into the _internal index.
**************************************************************************
============================================
Splunk Index License Size Analysis
============================================
Percent used by each index:
---------------------------
index=_internal source=*license_usage.log type=Usage
| fields idx, b
| rename idx AS index_name
| stats sum(eval(b/1024/1024)) as Total_MB by index_name
| eventstats sum(Total_MB) as Overall_Total_MB
| sort -Total_MB
| eval Percent_Of_Total=round(Total_MB/Overall_Total_MB*100,2)."%"
| eval Total_MB = tostring(round(Total_MB,2),"commas")
| eval Overall_Total_MB = tostring(round(Overall_Total_MB,2),"commas")
| table index_name, Percent_Of_Total, Total_MB, Overall_Total_MB
Total MB by index, Day – Timechart:
-----------------------------------
index=_internal source=*license_usage.log type=Usage
| fields idx, b
| rename idx as index_name
| timechart span=1d limit=20 sum(eval(round(b/1024/1024,4))) AS Total_MB by index_name
============================================
Splunk Sourcetype License Size Analysis
============================================
Percent used by each sourcetype:
-------------------------------------------
index=_internal source=*license_usage.log type=Usage
| fields st, b
| rename st AS sourcetype_name
| stats sum(eval(b/1024/1024)) as Total_MB by sourcetype_name
| eventstats sum(Total_MB) as Overall_Total_MB
| sort -Total_MB
| eval Percent_Of_Total=round(Total_MB/Overall_Total_MB*100,2)."%"
| eval Total_MB = tostring(round(Total_MB,2),"commas")
| eval Overall_Total_MB = tostring(round(Overall_Total_MB,2),"commas")
| table sourcetype_name, Percent_Of_Total, Total_MB, Overall_Total_MB
Total MB by sourcetype, Day – Timechart:
-------------------------------------------
index=_internal source=*license_usage.log type=Usage
| fields st, b
| rename st as sourcetype_name
| timechart span=1d limit=20 sum(eval(round(b/1024/1024,4))) AS Total_MB by sourcetype_name
============================================
Splunk host License Size Analysis
============================================
Percent used by each index:
-------------------------------------------
index=_internal source=*license_usage.log type=Usage
| fields h, b
| rename h AS host_name
| stats sum(eval(b/1024/1024)) as Total_MB by host_name
| eventstats sum(Total_MB) as Overall_Total_MB
| sort -Total_MB
| eval Percent_Of_Total=round(Total_MB/Overall_Total_MB*100,2)."%"
| eval Total_MB = tostring(round(Total_MB,2),"commas")
| eval Overall_Total_MB = tostring(round(Overall_Total_MB,2),"commas")
| table host_name, Percent_Of_Total, Total_MB, Overall_Total_MB
Total MB by host, Day – Timechart:
-------------------------------------------
index=_internal source=*license_usage.log type=Usage
| fields h, b
| rename h as host_name
| timechart span=1d limit=20 sum(eval(round(b/1024/1024,4))) AS Total_MB by host_name
============================================
Splunk Index Storage Size Analysis
============================================
Storage Size used by each non-internal index:
-------------------------------------------
index=_introspection component=Indexes NOT(data.name="Value*" OR data.name="summary" OR data.name="_*")
| eval data.total_size = 'data.total_size' / 1024
| timechart span=1d limit=10 max("data.total_size") by data.name
Storage Size used by each internal index:
-------------------------------------------
index=_introspection component=Indexes (data.name="Value*" OR data.name="summary" OR data.name="_*")
| eval data.total_size = 'data.total_size' / 1024
| timechart span=1d limit=10 max("data.total_size") by data.name
----------------------------------------------------
Thanks!
Vishal Nakra