Very non-scientific research recently revealed that discussing the nuances of the Splunk Common Information Model (CIM) is more effective than Ambien at curing Splunkers' insomnia. In fact, we once caused marked drowsiness across the entire state of Rhode Island when we accidentally broadcast an internal podcast on the subject over the local NPR affiliate.
That said, with the recent release of CIM 4.12, we packed in a number of exciting changes. My favorite is definitely the addition of an “Endpoint” data model. So, be you a threat hunter, an Endpoint Detection and Response (EDR) fan, or just curious about why we made any changes at all, this is the blog post for you. Make yourself a strong espresso, and onward, dear reader!
Before getting into the technical details, I want to take a moment to discuss why we did it and why it matters. The main driver was, of course, our customers. By providing a data model that works for modern endpoint solutions, we can provide more leverage for better detection and to enable hunt capabilities in customer environments. The second driver is our unquenchable thirst for innovation. As we see technologies become more widely adopted by security teams big and small, we provide tools to augment those technologies. Data models are one of many arrows in the Splunk quiver, and in this particular case it just made sense to do as we’ve done with other technology domains.
As far as what we actually did around endpoint data, the idea was to take two data models that have historically been used for “endpoint-like” data (the Application State and Change Analysis models) and combine them, then add some of the extra goodness that EDR solutions provide: parent/child process relationships, process hashes, integrity levels, etc. By combining these datasets together, we were able make the process of mapping raw events from EDR solutions to a Splunk data model more straightforward, and ultimately provide a way to abstract away the specific technology from the data and use-cases.
For those of you who have read the excellent documentation (thanks, Loria!), you'll also notice there are actually five different datasets in this model, all powered by separate base searches. Long story short, we discovered in our testing that accelerating five separate base searches is more performant than accelerating just one massive model. The practical implications are that you will want to get familiar with tstats append=t' (requisite David Veuve reference: "How to Scale: From _raw to tstats [and beyond!])
To hit home the potential impact of these changes, let’s borrow a page from James Brodsky’s .conf17 session, "Splunking the Endpoint Part III: Hands-On with BOTS Data!".
One of the searches in the detailed guide (“APT STEP 8 – Unusually long command line executions with custom data model!”), leverages a modified “Application State” data model:
| tstats values(all_application_state.process) as command
FROM datamodel="Application_State" where (host=venus OR
host=wrk-*) AND (all_application_state.commandline != *chrome*
AND all_application_state.commandline != *firefox*) groupby host
| mvexpand command
| eval cmdlen=len(command)
| eventstats stdev(cmdlen) as stdev,avg(cmdlen) as avg by host
| stats max(cmdlen) as maxlen, values(stdev) as stdevperhost,
values(avg) as avgperhost by host,command
| where maxlen>4*(stdevperhost)+avgperhost
With the new Endpoint model, it will look something like the search below. Note that we’re populating the “process” field with the entire command line. For all you Splunk admins, this is a props.conf change you’ll want to make with your sourcetypes, given it’s now a calculated field in the data model!).
| tstats values(Processes.process) as command
FROM datamodel="Endpoint"."Processes" WHERE (Endpoint.Proceses.process != *chrome*
AND Endpoint.Processes.process != *firefox*) groupby host
| mvexpand command
| eval cmdlen=len(command)
| eventstats stdev(cmdlen) as stdev,avg(cmdlen) as avg by host
| stats max(cmdlen) as maxlen, values(stdev) as stdevperhost,
values(avg) as avgperhost by host,command
| where maxlen>4*(stdevperhost)+avgperhost
While this is a fairly trivial example, the point is that endpoint telemetry is now a first-class dataset in Splunk. For those of you who use the wonderful and highly-useful TA-Sysmon, the illustrious Dave Herrald and Bhavin Patel have added support for the new Endpoint model in v8.1.0.
Now that we have this great new Endpoint model, what’s next? We have some updates planned in the future to help solve some use cases that have already popped up since the release. However, while we did add some backwards compatibility with the tags and event types for Application_State and Change_Analysis*, you should update those props and transforms to ensure you’re taking full advantage of all the new fields. Speaking of which, I would highly suggest a readup of how the Splunk Security Research Team is leveraging this new capability in their blog post, "Get More Flexibility and Accelerated Searches with the New Endpoint Data Model."
*Quick note on the topic of backwards compatibility, you will notice in the documentation that Change Analysis and Application State are marked as "deprecated." CIM 4.12 absolutely still ships with these, but ES has acceleration turned off by default, in lieu of accelerating the new Endpoint and Change models.
Hopefully you’ll find the new Endpoint data model valuable and useful in service to all your endpoint detection and threat-hunting needs. For the list of other changes in CIM 4.12, you can find them in the release notes.
Lastly, some thank yous to the larger Splunker community for all their help during this "process":
----------------------------------------------------
Thanks!
Kyle Champlin
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.