Security

July 07, 2023

8 Minute Read

Machine Learning in Security: Detect DNS Data Exfiltration Using Deep Learning

By Namratha Sreekanta

Since the Domain Name System (DNS) protocol is foundational for internet functionality, DNS traffic is allowed to move through firewalls without much scrutiny unlike HTTPS, FTP and SMTP. Malicious actors have successfully been able to exploit this advantage to transfer data between networks, which is beyond the original intention of DNS protocol.

Since the DNS is usually User Datagram Protocol (UDP) in nature, adversaries perform either high throughput tunneling by creating a Command and Control (C2) channel through which data moves reliably and bi-directionally between malware infected client and C2 server or low throughput data exfiltration by sending independent DNS queries containing small data. Data exfiltration can be carried out by outside attackers, where they inject malware into systems using sophisticated techniques like phishing and then the malware orchestrates data exfiltration periodically. Data exfiltration can also be an insider threat, where company employees can carry out movement of sensitive data outside of the company's secure network.

In low throughput data exfiltration, sensitive information (Personal Identifiable Information (PII), user credentials or confidential data) is encoded and included as subdomain texts in the DNS requests to the attacker's domain. If the sensitive data is too large to fit into a single DNS query, the malware breaks data down into query-sized chunks where each chunk is encoded and disguised as a DNS request. Since the attacker is aware of the malware, the attacker can easily decode a single query or even multiple queries from the same host to reconstruct the data.

Below is a typical scenario of DNS Data Exfiltration attack:

First, an attacker registers a malicious domain “ZDHDWIHMKKPWQRT.com” and configures it with the Name Server under the attacker's control.
The malware infected client encodes sensitive information like user credentials such as password “Password” as “BDJWBYYEOL4MLS”.
The malware generates a DNS request to one of the attacker’s registered domains by encapsulating encoded information in the subdomain text.
The Recursive Name Server routes the DNS request and forwards it to the Authoritative Name server.
The attacker extracts encoded information from subdomains and decodes it to retrieve sensitive stolen information.

To evade detection of low throughput data exfiltration, the attacker can change encoding schemes and also employ Domain Generation Algorithms (DGA), to dynamically generate and register domain names not prone to static analysis disruption.

As insider threats have almost doubled since last year, traditional techniques such as blocking DNS traffic by and large or use of Data Loss Prevention (DLP) tools have been inefficient. Detecting DNS exfiltration requests requires a complex process of lexical analysis, which might be challenging to do manually given the high volume and frequency of DNS queries. This is why Machine Learning can be useful to detect data exfiltration queries in near real time.

Splunk has developed a deep learning based detection in the Enterprise Security Content Update (ESCU) app that monitors your DNS traffic looking for signs of low throughput DNS exfiltration. The detection has an accuracy of 99.97% ensuring almost all suspicious DNS exfiltration requests are detected. The model is deployed using the Splunk App for Data Science and Data Learning (DSDL) and further details can be found here.

Detecting DNS Data Exfiltration as a ML problem

We collected huge DNS data over a long period of time and injected data exfiltration requests using data exfiltration tools. We approached the problem of classifying DNS data exfiltration requests as a binary classification problem by labeling exfiltration requests ‘is_exfiltration’ as 1 and as 0 for non exfiltration requests.

Recent research has primarily focused on detecting high throughput DNS Tunneling by monitoring traffic up to an hour and investigating unique attributes like query length, response codes and the encoded nature of the data. However these mechanisms are insufficient to detect low throughput data exfiltration requests because the DNS traffic is deliberately slow to avoid detection. To overcome this, we analyzed DNS traffic on recent activity and looked for patterns of exfiltration between a host and a top level domain. This approach is different when compared to the short time duration used in detecting high throughput tunneling. By considering the context/history of communication between a host and domain and creating features to capture recent activity, we were able to identify DNS data exfiltration requests which are, by nature, slow and camouflaged between benign DNS requests. Below is a list of single request features that were created for a single DNS request and aggregated request features on a sliding window of ‘x’ events between the same source and domain

len
length of the request
entropy
Shannon entropy of the DNS request
entropy_avg
average value of entropy over a sliding window of ‘x’ events
len_avg
average length of requests over a sliding window of ‘x’ events

Through data exploration, it was evident that the entropy and length of requests exfiltration cases is discriminately higher compared to non exfiltration cases.

Modeling

We started the modeling process by building a simple baseline using Random Forests classifier. The dataset was split into 90% for training and 10% for testing purposes.

Tokenization of DNS Request Text

We tokenized the DNS request text to numerical representation by creating a vector where each element contains the corresponding index of printable characters.

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']

The general assumption for a query length is around 255 characters including the dots, but

request length can greatly vary when the DNS request is over UDP/TCP or when it is using ascii-unicode. Instead of trimming/padding the text, we consider a fixed size vector of length 94 printable characters, where each element contains the count of the indexed character.

To this vector, we concatenate a vector that corresponds to pre-computed features, which resulted in the vector size 98 and was passed as an input to the Random Forest model.

We were able to achieve 100% True Negative Rate (TNR) and about 0.35% of False Negative Rate (FNR). To further reduce the FNR, we experimented with deep learning models that can better generate feature representations from domain text and can more accurately classify exfiltration cases. Below sections will walk you through model architecture, training and results.

Multilayer Perceptron Model

A Multilayer Perceptron is a feed forward neural network with an input and an output layer, and one or more hidden layers. The inputs are combined with weights as a linear combination and passed through an non-linear activation function. The inputs are propagated from one layer to the next where the units learn internal representations of data. To minimize the cost of wrong predictions, the model performs back propagation by iteratively adjusting the input weights of each layer.

Handling Imbalanced Dataset

Since deep learning models train on huge datasets that cannot be fit into memory, the dataset is trained in batches every single iteration. The model is evaluated and weights are adjusted while making several such iterations. When a highly imbalanced dataset is divided into small multiple batches, each batch can be under-represented i.e., majority of batches have fewer or no exfiltration cases which minimizes the error of classifying majority class in the batch. The effect of class imbalance can be detrimental to a classifier's performance and to overcome this, we oversampled the minority class to an extent that there is no more imbalance in the dataset.

Model Architecture

Since we tokenize the text as described above, we do not need to specifically create vocabularies or embedding layers. The model architecture contains 2 dense layers with 256 units each. The dense layers have ‘relu’ activation function and a drop out of 0.5 to avoid overfitting. The final layer of the model is a single dense layer with a sigmoid activation function that generates probability scores indicating how likely the input is exfiltration/ non-exfiltration.

Layer 0 - is the input layer of 98 features as described above.

Layer 1 - is the first linear layer with 256 units, dense (fully connected) with the previous layer. It has a drop out of 0.5 to avoid overfitting and a rectified linear unit (ReLU) as the activation function

Layer 2 - is the second linear layer with 256 units, dense (fully connected) with the previous layer. It has a drop out of 0.5 to avoid over fitting and ReLU as the activation function.

Layer 3 - is the final output layer. This layer has a Sigmoid activation function, which converts output into a probability score indicating how likely the instance is exfiltration or not.

Training and Results

For training and testing purposes, we divided the dataset into train, validation and test datasets of sizes 80%, 10%, 10% respectively. Since training deep learning models on large datasets can take hours, we leveraged GPUs which are specialized at performing advanced mathematical transformations for our model computation.

A confusion matrix describes the classifier performance by comparing actual and predicted values. Comparing the deep learning model performance with the baseline, the deep learning has reduced the False Negative Rate (FNR) from 0.35% to 0.03% indicating we were able to identify almost all exfiltration cases with a very low misclassification error. A True Negative Rate (TNR) of 100% indicates that the model has learnt normal DNS behavior really well with a very low 0.01% misclassification error.

Deployment

The pretrained model is available here and can be easily deployed using the Splunk App for Data Science and Deep Learning (DSDL). Check out the instructions to deploy the model using DSDL here. Once the model is deployed, the pretrained model can be easily used within your SPL search by appending ‘| apply detect_dns_data_exfiltration_using_pretrained_model_in_dsdl’.

| tstats `security_content_summariesonly` count from
  datamodel=Network_Resolution by DNS.src _time DNS.query |
  `drop_dm_object_name("DNS")` | sort - _time,src, query | streamstats count as
  rank by src query | where rank < 10 | table src,query,rank,_time | apply
  detect_dns_data_exfiltration_using_pretrained_model_in_dsdl | table
src,_time,query,rank,pred_is_dns_data_exfiltration_proba,pred_is_dns_data_exfiltration
  | where rank == 1 | rename pred_is_dns_data_exfiltration_proba as
  is_exfiltration_score | rename pred_is_dns_data_exfiltration as
  is_exfiltration | where is_exfiltration_score > 0.5 |
  `security_content_ctime(_time)` |
  table src, _time,query,is_exfiltration_score,is_exfiltration |
  `detect_dns_data_exfiltration_using_pretrained_model_in_dsdl_filter`

The pretrained model detect_dns_data_exfiltration_using_pretrained_model_in_dsdl uses src, _time, query and rank as input and outputs a probability score is_exfiltration_score which tells us how likely the DNS request is an exfiltration request. The detection works on events in the Network_Resolution data model, which are then ranked by _time for each src and query. The search filters the 10 most recent events as a recent history of past events/interactions between the same src and domain. The model then creates features pertaining to the latest DNS request along with aggregated features for past events and predicts if the latest DNS request is a DNS exfiltration case. The detection filters out most recent DNS requests that are most likely exfiltration requests. The threshold is set at 0.5, which is tunable by customers. For Splunk Enterprise Security customers, the ESCU detection for detecting DNS data exfiltration is readily available in ESCU v.4.5.0. The detection generates risk events for every possible DNS exfiltration case detected. The risk events are then processed by the Risk-Based Alerting (RBA) framework of Enterprise Security to generate notables. Read through our recent blog post for more details.

Conclusion

Recent reports suggest that data theft still remains a major concern as traditional systems and techniques fail to detect the early signs of data exfiltration. Previous works have mainly focussed on data tunneling while data exfiltration still remains crucial as it is the most common technique used for ransomware and the longer it stays undetected, the more data can be exfiltrated.

Most machine learning models investigate the latest DNS request without attaching any valuable context of communication history between the host and the domain. Instead of considering a short time window, which may be insufficient for low throughput DNS exfiltration, we consider a recent history of past ’x’ events. The deep learning model not only creates features to represent the current DNS request but also creates aggregated features over recent history of events. Our deep learning model performs really well and has a very low misfire rate ensuring almost all benign DNS requests are classified correctly. Since the risk of DNS exfiltration is lower than 0.1%, the model with a False Positive Rate of 0.01% very rarely raises false alarms.

Feedback

Any feedback or requests? Feel free to put in an issue on GitHub and we’ll follow up. Alternatively, join us on the Slack channel #security-research. Follow these instructions if you need an invitation to our Splunk user groups on Slack.

Acknowledgments

Special thanks to Splunk Threat Research and the Splunk Product Marketing Team.

This blog was co-authored by Abhinav Mishra (Principal Applied Scientist), Kumar Sharad (Senior Threat Researcher) and Namratha Sreekanta (Senior Software Engineer)

Staff Picks for Splunk Security Reading March 2023

In this month's Staff Picks blog, our Splunk security experts curate a list of presentations, whitepapers, and customer case studies that we feel are worth a read.

Security 4 Min Read

TruSTAR Intel Workflows Series: Shifting from App-Centric to Data-Centric Security Operations

TruSTAR recently introduced API 2.O featuring TruSTAR Intel Workflows. This blog series will explain our motivations for building this feature, how it works, and how users can better inform security operations.

Security 8 Min Read

Splunk and Tensorflow for Security: Catching the Fraudster with Behavior Biometrics

Raising the barrier for fraudsters and attackers: how to leverage Splunk and Deep Learning frameworks to discover Behavior Biometrics patterns within user activities

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram