Malicious software like ransomware often use tactics, techniques, and procedures such as copying malicious files to the local machine to propagate themselves across the network. A few years ago, the Cybersecurity and Infrastructure Security Agency, the Federal Bureau of Investigation, and the Department of Health and Human Services issued a joint cybersecurity advisory to ward off potential harm from threat actors for at-risk entities. The focus was on attacks from malicious cyber actors utilizing TrickBot and BazarLoader malware, often leading to ransomware attacks, data theft, and the disruption of healthcare services. Despite the warning, these threats continue to haunt our networks with increasing sophistication.
Cybercriminals disseminate malicious software via phishing campaigns that contain either links to malicious websites that host the malware or attachments with the malware. These malicious files are usually labeled using legitimate sounding programs or document names to lure victims into opening them. After successful installation, the malware performs activities such as credential harvesting, mail exfiltration, cryptomining, point-of-sale data exfiltration, and the deployment of ransomware.
One key indicator of compromise by certain malware families is that after a successful execution of the malware, it copies itself as an executable file with a randomly generated filename and places this file in one of the directories, such as hwbpoidtowerp.exe. These techniques are categorized as Masquerading by the MITRE ATT&CK Framework and are seen in several malware campaigns such as Worm:W32/Downadup.AL. Hence, it becomes important to distinguish process names that have been organically created by a user from those automatically generated in a random fashion by a malware.
Malware that propagates across a network may use randomly generated file names instead of masquerading as legitimate native binaries. One way to detect their presence is to separate randomly named processes from those that we expect to see commonly. To achieve this, we set out to build a classifier that can distinguish between randomly generated process names from those created by a user.
We developed a machine learning based analytic that uses a Recurrent Neural Network (RNN) to distinguish between malicious and benign process names. We use a character level RNN to classify malicious vs. benign process names. A RNN is a class of neural networks that is particularly well suited to predicting sequences. Compared to Recurrent Neural Networks, Regular Neural Networks and Convolutional Neural Networks are rigid in the way they work. They only allow a fixed-sized vector as input and output. Additionally, the number of layers in the network is also fixed before training. RNNs are a lot more flexible in this regard and allow us to operate over sequences of vectors.
We train a model to predict the class of a given process name based on the characters seen. The core of the idea is that the character distribution of random sequences is very different from that of sequences which follow certain rules such as English language words, where all characters are not equally likely. Additionally, given a sequence of characters, only certain characters are likely to follow them. Hence, a model can be trained to distinguish between the two. We describe our approach in further detail in sections below.
We use internal datasets for our modeling exercise. We will now describe the series of steps that go into the model design and architecture starting with preprocessing the data.
The model expects a process name created using English language characters, we focus on lowercase letters only as we saw in experiments that the case does not make a significant difference. We only consider the English alphabet at this time as it is a more challenging problem which covers most common scenarios. The model design allows it to be extended to alpha-numeric characters in the future. A character sequence representing a process name is first converted from Unicode to ASCII to ensure we only see characters from a-z, thus removing character accents. Then the process name is converted into a Tensor by representing each character as a one-hot vector. In brief, a one-hot feature vector contains a 1 only where the feature is present and 0s everywhere else. Each character of the process name is encoded to indicate its presence in the alphabet feature vector. The figure below presents an example of a one-hot feature vector. We finally end up with a Tensor of size processname_length x batch_size x num_letters. The batch size is used to partition data during training.
Splunk, One-hot Feature Vector, 2023
The RNN we use for this problem consists of two linear layers, which take the current character and a hidden layer as input. The linear layers learn a transformation of input and hidden layer combination to make an accurate prediction. The output obtained from the input-to-output linear layer is passed through a log softmax layer to output a prediction for a given letter of the process name. The input-to-hidden linear layer takes the same input and hidden layer combination to learn the transformation to the hidden layer, which encodes sequence information of all the characters seen so far. The parameters of the linear layers are tuned via backpropagation as the RNN observes more samples. As a result, a trained RNN learns parameters which minimizes the prediction error. State information encoded by the hidden layer is critical in a correct final prediction as a character alone doesn’t provide sufficient information for a correct prediction. Hence, learning the correct hidden layer parameters has a significant effect on the final result. The architecture is shown in the figure below.
Splunk, Character Level RNN Architecture, 2023
The RNN is trained to predict the class of the word having seen a sequence of characters. It is initialized with an empty hidden state. To put it more concretely, let’s take a look at the word pizza – when the RNN sees the character p it tries to predict the output class; then having seen pi, it repeats the process to predict the output class for pi. This goes on till all characters of the word are exhausted and the process concludes with the final predicted class of the word (i.e., complete sequence of characters), which in this case would be benign. At every step, the only input to an RNN layer is the current character and a hidden layer which encodes the sequence information encountered so far. In this contrived example we see that the prediction becomes more confident towards the correct class as the trained RNN processes more and more characters.
We use the negative log likelihood loss function and logsoftmax as the last layer; this choice provides stable learning. The final output is converted to a prediction score (between 0 and 1) for each class by taking the exponent.
The RNN performs very well in classifying benign vs. malicious process names. As seen in the confusion matrix below, we get 99.36% True Positive Rate for 0.62% False Positive Rate (here malicious is regarded as the Positive class), thus ensuring high confidence in the system. The True Negative Rate is 99.38 % and False Negative Rate is 0.64 % which indicates we miss very few cases.
Even the prediction errors made by the RNN model are quite informative. Many times they are English language words which look suspicious (e.g. cholecystojejunostomy, bumbailiffship, hemidemisemiquaver) or randomly generated words which could plausibly be a dictionary word (e.g. eavsreppoasrugs, laoniumnoabohae, herssiaileic, daohepgauaeli). So, investigating these cases is not entirely futile from the perspective of the security analyst.
Splunk, Confusion Matrix, 2023
Once the model is ready we can deploy it as a pre-trained model using the Splunk App for Data Science and Deep Learning (DSDL). We did a detailed walkthrough of the deployment process and the mechanisms behind DSDL in a previous blog post, or check out the guide to get started. After deployment, the model is ready to be used in a SPL query, which is shown below.
| tstats `security_content_summariesonly` count min(_time) as firstTime max(_time) as lastTime from datamodel=Endpoint.Processes by Processes.process_name Processes.parent_process_name Processes.process Processes.user Processes.dest | `drop_dm_object_name(Processes)` | rename process_name as text | fields text, parent_process_name, process, user, dest | apply detect_suspicious_processnames_using_pretrained_model_in_dsdl | rename predicted_label as is_suspicious_score | rename text as process_name | where is_suspicious_score > 0.5 | `detect_suspicious_processnames_using_pretrained_model_in_dsdl_filter`
Splunk, SPL Detection, 2023
The SPL query offloads most of the heavy lifting to the RNN model. After collecting the process names, the apply command invokes the prediction pipeline. As shown in the detection system design below, the prediction pipeline pre-processes the text, converts it into a tensor and then feeds it to the RNN to make a prediction which is finally converted into a score. As a final step, these results can be filtered based on the threshold T which can be tuned as per the propensity for risk of the security analyst. The text pre-processing is an important step of the pipeline. The Processes.process_name returns a string which may be the full path of the executable file, e.g. C:/PROGRA~1/ECOSTR~1/1180~1.1/pgsql/bin/postgres.exe. The preprocessing step separates out the postgres part for an accurate prediction by the RNN. You can find more details of the detection, including the deployment instructions, in the following documentation.
Splunk, Detection System Design, 2023
This post discussed a common compromise technique popular among threat actors to propagate malware, such as creating randomly named executables. We saw that the character level RNN language model can serve as a powerful tool to detect suspicious processes and how finding such processes could detect the compromise via malware. The designed model is robust as indicated by a True Positive Rate of 99.36 % for a low False Positive Rate of 0.62 %. This minimizes the false alarms and boosts confidence in the system. The model catches almost all malicious cases with a True Negative Rate of 99.38 % and False Negative Rate of 0.64 %, and can be deployed in scenarios where the margin for error is low. RNNs are very versatile; the techniques discussed here are generic and can be used to detect any malicious character sequence such as URLs, domain names, filenames, etc. The power of language models can be fully harnessed through the Splunk ecosystem. By deploying detections using DSDL, we offer pre-trained deep learning models which are conveniently accessible via SPL. These models produce predictions that are robust and can be tuned to suit the security posture of the environment where it is deployed.
Learn even more by watching the ML in Security: Elevate Your DGA Detection Game tech talk.
Any feedback or requests? Feel free to put in an issue on GitHub and we’ll follow up. Alternatively, join us on the Slack channel #security-research. Follow these instructions if you need an invitation to our Splunk user groups on Slack.
We thank the Splunk Threat Research Team for their comments on this blog.
This blog was co-authored by Abhinav Mishra (Principal Applied Scientist), Kumar Sharad (Senior Threat Researcher) and Namratha Sreekanta (Senior Software Engineer).
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.