Hypothesis-driven hunting is probably the most well-known type of threat hunting, and it’s one of the three types defined in the PEAK threat hunting framework. In this article, we’ll walk through a sample hypothesis-driven hunt, step-by-step. For our data, we’ll be using the Boss of the SOC Version 3 (BOTSv3) dataset, which you can use to recreate the hunt and work through it on your own.
Below is a diagram of the Hypothesis-Driven hunting process. This diagram serves as a valuable guide in illustrating the structured steps involved in each phase for the successful execution of the hypothesis-driven hunts within the PEAK Framework.
Figure 1: The PEAK Hypothesis-Based Hunting Process
The first step is to select a topic that you are keen to hunt. Topics could come from nearly anywhere: organizational security priorities, suggestions from your CTI team, and identified detection gaps are all great sources. For our example, our topic will be cryptocurrency mining.
Once you have identified the topic of interest, the next step entails gaining a comprehensive understanding of the chosen topic through research. This research serves as a foundational step in refining your hypothesis and crafting a more informed approach to the hunt. For our example, our research included the following resources:
Building on the research conducted, the next step is to generate a hypothesis that will serve as the basis for your hunt. Make sure your hypothesis is testable so that you can either confirm or refute it while hunting. For this example, our hypothesis is “There might be unauthorized cryptocurrency mining happening on the network.”
With your hypothesis set and ready, it is time to define the boundaries of investigation for the scope of your hunt. The scope will include setting a maximum duration for the hunt and utilizing the Actor, Behavior, Location, Evidence (ABLE) framework to assist in capturing the essential elements of your hunting hypothesis. For our illustrative example, the scope of our hunt is as follows:
Timeframe: We would normally conduct a hunt across at least a 30-day period. However, there’s only 1 day available in the BOTSv3 data, which necessarily limits our scope.
Maximum Hunt duration: This is the amount of time we expect to spend executing our hunt. Based on our research, we set aside 3 days, but it is not unusual for hunts to last much longer.
ABLE:
Now that your research is comprehensive and the ABLE data is organized, the next crucial step involves formulating a plan for the approaches you intend to use in validating your hypothesis. This planning is essential in ensuring smooth execution of your threat hunting. For our example, here are the approaches we are going to incorporate in our plan, both based on MITRE ATT&CK detections for Resource Hijacking:
In our sample hunt, we are using the Splunk BOTSv3 dataset, where the necessary logs are already ingested into our Splunk instance. Note that in a real hunt, you may also need to decide on how you are going to gather the various data sources mentioned in your scoped hunt.
The first step is to gather the data needed for your hunt. The src to data collection may vary, especially if your organization already has a SIEM collecting the various data sources into a central location for analysis. In cases where SIEM is not in place or does not cover all the required data sources, you might have to identify the specific server(s) and locations on disk from which to collect the data, then manually transfer them to the analysis system. In our example, the data we need has already been loaded into Splunk.
The perfmon data (sourcetype=PerfmonMk:Process) has information about processes running on the system, as captured in roughly 10 second intervals by Microsoft’s Performance Monitor. Events here are periodic snapshots of process data with point-in-time CPU utilization information with fields such as process_name, process_cpu_used_percent and process_mem_used. This would be useful in our Approach 1 as we will be able to observe how some processes are utilizing CPU across time and we will be able to drill down to processes with notably high CPU utilization. For illustration purposes, here are several entries for the same process, a particular instance of MsMpEng.exe, during its lifetime.
Figure 2: Example of a single process utilizing CPU over time
The DNS query data (sourcetype=stream:dns) is more straightforward. It has information about DNS requests gathered from DNS server logs. The entries contain fields such as dest_ip, src_ip, dest_port, src_port, bytes, and query. This would be used in Approach 2 to find out if any of the hosts made any successful connections to known cryptomining domains.
There are times when the collected data may not be in the optimal state for analysis. This can be due to data having missing values, malformed or corrupted entries, or even just because the data format is not compatible with your analysis system (e.g., you need CSV format, but it’s in JSON). If the data is already in Splunk, there’s a good chance that it’s already been cleaned and normalized, though this isn’t guaranteed. This step may require you to do some data cleaning and normalization in order for you to begin your analysis.
In our example, the BOTSv3 data is ready for analysis, hence sparing us the need for extensive cleaning and normalization.
You have gathered and cleaned up all your essential data, it is time to execute your plan and analyze the data to look for evidence that supports or refutes your hypothesis.
Approach 1: Sensor Health (DS0013)
According to MITRE ATT&CK, we should consider monitoring process resource usage to determine anomalous activity associated with malicious hijacking of computer resources such as CPU or GPU resources. None of our systems have much GPU power to speak of, so we’ll concentrate on long-running processes with significant, sustained CPU usage.
Our SPL searches for any process that lasts for at least five minutes and above and has a CPU utilization of 90% or more throughout its lifetime. This cutoff of 90% is arbitrary and should be set according to the threat hunter’s own risk appetite, their knowledge of their network environment, and any threat intelligence available. Additionally, there are summary statistics included to help the hunter with information such as minimum, maximum, and average CPU utilization, and also counts of CPU utilization spikes. We calculate a simple risk score based on the number of events showing high CPU usage over the process's lifetime. Processes with higher risk scores are more likely related to cryptomining activities.
``` Find all records for any process that 1) exhibited ANY high CPU usage, and 2) ran for at least five minutes. The subsearch finds the list of processes that meet these requirements, then the main search retrieves ALL the records for each of these processes across their entire run times (even if they didn’t all show high CPU).```
index=botsv3 sourcetype="perfmonmk:process"
[search index=botsv3 sourcetype="perfmonmk:process" process_cpu_used_percent>=90 Elapsed_Time>=300
| table host, process_name, process_id
| dedup host, process_name, process_id]
``` Calculate some summary stats, including the number of times each process exhibited high CPU over its entire run ```
| eval high_cpu=if(process_cpu_used_percent>=90, 1, 0)
| stats count, earliest(_time) as et, latest(_time) as lt, max(Elapsed_Time) as elapsed, min(process_cpu_used_percent), max(process_cpu_used_percent), avg(process_cpu_used_percent) as avg_cpu, sum(high_cpu) as high_cpu by host, process_name, process_id
``` Convert timestamps to human-readable format ```
| convert ctime(et), ctime(lt)
``` Score each process by the ratio of high CPU instances to total run time ```
| eval risk_score=(high_cpu/elapsed)*100
``` Order by the new score ```
| sort - risk_score
Note that the five-minute threshold is an artifact of our simulated dataset. In a production computing environment, a cryptominer would probably run for far longer. If you reproduce this hunt with your own data, you’ll almost certainly want to extend this time. 30 minutes, or even longer, would probably be more useful.
Figure 3: Processes with high CPU Utilization
The search returns three processes that have sustained CPU utilization across most of their runtime, all on the host BSTOLL-L. However, the Chrome process (chrome#4) immediately stands out as it has the highest risk score and the highest number of high CPU events over the second-longest elapsed time. Also, the other two results are legitimate Windows processes and are known to use large amounts of CPU from time to time.
While high utilization from a Chrome process does not confirm the existence of cryptomining, it does suggest that if this were a cryptominer, it would most likely be browser-based. From our research, we know that families such as CoinHive, Crypto-Loot, and JSEcoin are miners that run inside browser tabs, so this Chrome process is a plausible candidate for a cryptominer. However, we can’t jump to conclusions; more investigation is required to verify that this is actually a cryptominer.
Approach 2: Network Traffic (DS0029)
According to MITRE ATT&CK, we could monitor for newly constructed network connections that are sent or received by untrusted hosts, look for connections to/from strange ports, check the reputation of IPs and URLs, and monitor network data for uncommon data flows. There are many ways we might identify cryptominers, but just looking for anomalous network connections is time-consuming and not really focused on our topic, specifically. Instead, we want to find connections to known blacklisted cryptomining domains, which we’ll identify using DNS query logs.
In our research, we found some lists of domains used by CoinHive and similar JavaScript bitcoin miners. In total, we found about 4.6k domains, which we uploaded to our Splunk search head as a CSV file. The following query will identify DNS queries for any of the uploaded domains:
index=botsv3 sourcetype="stream:dns"
| lookup cryptocurrency_mining_list_large.csv domain AS query OUTPUTNEW domain AS domain_matched
| stats min(_time) as first_seen, max(_time) as last_seen, count by host, domain_matched
| table domain_matched, host, first_seen, last_seen, count
| convert ctime(first_seen), ctime(last_seen)
| sort +first_seen
Figure 4: Results of Cryptomining Domains Hits based on IoC
These results show that there were DNS lookups for coinhive[.]com and five of its subdomains, all from the same computer (BSTOLL-L). This is the same system that hosted the suspicious Chrome process. The two findings support each other, and we can be reasonably sure that a cryptominer was running on BSTOLL-L.
This is the stage where you actually begin to take some action on any malicious activity you found, or maybe you refer it to the IR team, depending on how your organization handles these things.
In this case, we have one critical finding: BSTOLL-L was running a cryptominer, as evidenced by the fact that we found DNS queries associated with Coinhive and the fact that we (probably) found the exact process doing the mining. We would have escalated this based solely on the DNS queries, but the fact that we found it both times gives us valuable context to begin the investigation.
Your executive summary might look like this:
Between 13:38:19 PM and 13:39:30 PM of Aug 20 2018, host BSTOLL-L was observed querying 6 unique Coinhive cryptocurrency mining domains/subdomains. Coinhive is a cryptocurrency mining service designed to be installed on websites that hijacks the computing power of any browser that visits the website to mine cryptocurrency. Immediately after, between 13:38:30 PM and 14:04:11 PM (~26 minutes), the ‘chrome#4’ browser’s (PID 3400) CPU utilization surged to near 100%. Close time proximities of these events suggests that there may be unauthorized cryptomining via web browsers occuring in our network.
This step is optional but may be necessary when you are unable to confirm or deny your initial hypothesis. In our sample hunt, we managed to find traces of cryptomining activities on both approaches, which confirmed our initial hypothesis, hence there is no need to refine it.
The “Act” phase is all about making sure the knowledge gained from your hunt is captured and acted on. It’s what allows hunting to drive security improvement in your organization.
The techniques used in your hunt can be archived in the form of detection rules for future hunts of similar topics. For example, we could archive and use our approaches again if we need to hunt for other cryptomining-related topics. Many hunt teams have internal central repositories to archive all the past hunts’ reports. As hunters commonly reference past experiences, the investment in preserving records of previous hunts becomes valuable for posterity.
When you preserve a hunt, make sure to include:
Your documentation of findings represents the significance and impact of your entire hunt. These findings along with the subsequent actions taken to address them, are key drivers for continuous improvement in your organization’s security posture.
The documented findings should include:
Regardless of how your organization’s change process is, your findings should be converted into production detection rules or signatures to catch similar threats in the future. Using your hunts to improve automated detection is the other key driver behind the continuous improvement of your organization’s security posture.
Keep in mind that, according to PEAK’s Hierarchy of Detection Outputs, you have multiple options for the detections you create. For example, while the DNS analysis might make a great choice for an automated alert, the CPU usage analysis is only suggestive of cryptomining, and not suitable for automated alerting. In this case, it might be better to create a dashboard using those results and have an analyst review it on a regular basis.
As you hunt, you might find new ways to look for the hypothesized activity, or activity that’s closely related. You don’t necessarily want to go down a rabbit hole chasing something that’s not directly in support of your hypothesis, so add these new ideas to your hunt topic backlog. In this case, we didn’t have anything to add, though.
Share your hunting discoveries with the relevant stakeholders, such as the SOC, system owners, or other security teams. You might meet with them to brief them on your recent hunt, send an email summarizing your approach and findings, or use any other method that’s convenient for them to consume and understand.
We’ve demonstrated how you can use the PEAK framework to conduct a hypothesis-based hunt, in this case to detect unauthorized cryptomining. By applying the PEAK methodology, organizations can fortify their defenses, proactively identify and neutralize potential threats to improve their overall cybersecurity posture, and protect their digital assets more effectively.
As always, Happy Hunting!
This article was produced in collaboration with Jefnilham Jamaludin and Wei Liang Tan of the Cyber Security Agency of Singapore.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.