Welcome to the final installment in our “Add to Chrome?” research! In this post, we'll experiment with a method to find masquerading, or suspicious clusters of Chrome extensions using Model-Assisted Threat Hunting (M-ATH) with Splunk and the Data Science & Deep Learning (DSDL) App. M-ATH is a SURGe-developed method from the PEAK framework, which uses models or algorithms to help find threat-hunting leads, or to help make complex problems more approachable.
M-ATH is well-suited for identifying suspicious Chrome extensions because of the complexity of extensions, and the scale of the Web Store. Chrome extensions are not actually singular components, but compilations of multiple files and file types, written in different programming languages and designed for diverse functions, e.g., language translation, ad-blocking, or password management. The Chrome Web Store hosts more than 140,000 browser extensions for the Chrome Web Browser. The extension contents are packaged in compressed “.crx” files, which can contain the file types, scripting languages, and metadata required to make the extension work.
Confirming an extension is malicious typically requires reverse-engineering effort, and is often dependent on the context in which the extension is purportedly operating. We can narrow down this haystack using M-ATH methods to identify Chrome Extensions worth further analysis. To see everything in action, or to follow along, we’ve made the full data corpus and notebook available on our GitHub and Google Collab!
Masquerading is a critical challenge in any curated app store, open-source marketplace or package repository. In masquerading efforts, attackers will subvert the user’s trust assumptions by creating stand-in apps that look like, or sound like a trusted or well-understood service, backdooring the files in many cases to steal personal/private information, or implant malware onto the user’s machine.
Our M-ATH approach operates under the hypothesis that if we start with a popular ‘target extension’ we can use similarity measures to find suspicious extensions that try to imitate or masquerade the target extension’s likeness. As an example, we’ll start with the popular ‘Google Translate’ extension, which integrates a ‘right-click’ option to quickly translate text in your browser from different languages.
To start, we need good data that represents the contents and attributes of the extensions hosted on the Web Store. The SURGe Team (with the approval of the Google security team) scraped the contents of the Chrome Web Store to construct a baseline dataset of the following:
Field | Description |
---|---|
name | Name of the Chrome Extension |
description | Description of the functionality of the browser extension. |
crx | The ‘.crx’ file is a packaged Chrome Browser extension. Each CRX is assigned a unique hash value by Google, for tracking. |
Extension File Contents | The CRX package contains multiple files to support the functionality of the extension. These can include e.g., JavaScript files, HTML files, images, manifest file (manifest.json), and more. |
Next, this data was enriched with some additional metrics:
Our hunting approach relies on aspects we’d expect to see from a masquerading extension – it would attempt to look like, sound like, and/or be described like our target extension. How do we turn these ideas into measurements? Using our baseline data and the hashes we’ve added, we have several methods at our disposal:
With these ideas in hand and a target browser extension (Google Translate) to start our hunting, we can quickly identify the next most similar in appearance, name, and description from the sea of more than 140,000 extensions. First, let’s break these down a bit more with some examples.
The Levenshtein Similarity is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e., insertions, deletions, or substitutions) required to change one word into the other. In this case, we invert the metric (i.e. converting distance to similarity), to match the orientation of our other similarity metrics. If we sort by our Levenshtein metric, we can quickly find the most similarly-named extensions, and their unique crx identifiers for pivoting to the full file contents.
Icon similarity is assessed using Hamming Similarity between the Color Moment Hash of two extensions. Color Moment Hash is a compact representation (a hash) of an image, based on the statistical moments of its color components. This hash is valuable for image comparison because it encapsulates significant information about the image's color distribution while being relatively insensitive to small changes or distortions in the image. Measuring the similarity between color moment hashes quickly locates exact matches to our target icon, or similar but slightly modified versions:
Cosine Similarity is a metric used to compare the similarity of the description fields of text. Each text is represented as a vector (numerical encoding), where each dimension corresponds to a word from the combined set of words in both texts, and the value in each dimension corresponds to the weight of that word in the text. Cosine similarity is then used to find the cosine of the angle between these two vectors, which reflects similarity between the text fields, e.g.:
We used the previous three measures to hunt for similarity between names, descriptions and icon appearance individually, but a true masquerading extension could combine more than one of these attributes. To better assess our similar extensions as a whole, we will combine the top results from each category, cluster them with K-means, and then measure the distance from each point to our target vector (Google Translate), using Euclidean distance.
K-means is an unsupervised learning algorithm that partitions a dataset into k distinct, non-overlapping clusters based on the attributes of the data points – in this case, the Levenshtein, Cosine and Color Moment Hamming similarity measures. The algorithm aims to minimize the within-cluster variances and maximize the between-cluster variances, meaning that it seeks to create clusters where members of the same cluster are as similar as possible while also being as different as possible from members of other clusters. By assigning colors to different cluster values, we can visualize our data in a three-dimensional scatter plot, across each of their similarity measures.
3-D Scatter Plot of Levenshtein, Color Moment, & Cosine Similarity Measures
In summary, we demonstrated how we can apply hashing and vectorization to encode and enrich our dataset, use algorithms to measure the similarity between vectors and use modeling to combine and cluster multiple elements of similarity. This approach allows us to start with any single extension, and quickly reduce our dataset of 140,000 extensions down to a handful of candidates for deeper reverse engineering analysis.
3-D Scatter Plot of Levenshtein, Color Moment, & Cosine Similarity Measures
The notebook and data we’ve released create an independent way to explore and analyze Chrome Web Store extension data yourself. In addition, the process and metrics can be transferred to a variety of different Model-Assisted Threat Hunting or data exploration challenges using Splunk AI and DSDL. Can this approach be applied to find new threats in your environment?
Stay tuned for more updates from the SURGe team, and Happy Hunting!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.