Security

March 04, 2024

5 Minute Read

Add To Chrome? - Part 4: Threat Hunting in 3-Dimensions: M-ATH in the Chrome Web Store

By Ryan Fetterman

Welcome to the final installment in our “Add to Chrome?” research! In this post, we'll experiment with a method to find masquerading, or suspicious clusters of Chrome extensions using Model-Assisted Threat Hunting (M-ATH) with Splunk and the Data Science & Deep Learning (DSDL) App. M-ATH is a SURGe-developed method from the PEAK framework, which uses models or algorithms to help find threat-hunting leads, or to help make complex problems more approachable.

M-ATH is well-suited for identifying suspicious Chrome extensions because of the complexity of extensions, and the scale of the Web Store. Chrome extensions are not actually singular components, but compilations of multiple files and file types, written in different programming languages and designed for diverse functions, e.g., language translation, ad-blocking, or password management. The Chrome Web Store hosts more than 140,000 browser extensions for the Chrome Web Browser. The extension contents are packaged in compressed “.crx” files, which can contain the file types, scripting languages, and metadata required to make the extension work.

Confirming an extension is malicious typically requires reverse-engineering effort, and is often dependent on the context in which the extension is purportedly operating. We can narrow down this haystack using M-ATH methods to identify Chrome Extensions worth further analysis. To see everything in action, or to follow along, we’ve made the full data corpus and notebook available on our GitHub and Google Collab!

Background

Masquerading is a critical challenge in any curated app store, open-source marketplace or package repository. In masquerading efforts, attackers will subvert the user’s trust assumptions by creating stand-in apps that look like, or sound like a trusted or well-understood service, backdooring the files in many cases to steal personal/private information, or implant malware onto the user’s machine.

Our M-ATH approach operates under the hypothesis that if we start with a popular ‘target extension’ we can use similarity measures to find suspicious extensions that try to imitate or masquerade the target extension’s likeness. As an example, we’ll start with the popular ‘Google Translate’ extension, which integrates a ‘right-click’ option to quickly translate text in your browser from different languages.

Data

To start, we need good data that represents the contents and attributes of the extensions hosted on the Web Store. The SURGe Team (with the approval of the Google security team) scraped the contents of the Chrome Web Store to construct a baseline dataset of the following:

Field	Description
name	Name of the Chrome Extension
description	Description of the functionality of the browser extension.
crx	The ‘.crx’ file is a packaged Chrome Browser extension. Each CRX is assigned a unique hash value by Google, for tracking.
Extension File Contents	The CRX package contains multiple files to support the functionality of the extension. These can include e.g., JavaScript files, HTML files, images, manifest file (manifest.json), and more.

Next, this data was enriched with some additional metrics:

Color Moment Hash: A method used to create a hash representation of an image based on its color moments. It aims to capture the statistical information of an image – in this case, the hash is applied to the extensions logo file (pulled from the full extension file contents), and stored in a field called image_cmhash.

Approach

Our hunting approach relies on aspects we’d expect to see from a masquerading extension – it would attempt to look like, sound like, and/or be described like our target extension. How do we turn these ideas into measurements? Using our baseline data and the hashes we’ve added, we have several methods at our disposal:

Levenshtein similarity can measure the likeness between Extension Names
Color Moment similarity can measure the likeness between Extension Icons
Cosine Similarity can measure the likeness between Extension descriptions
Unsupervised Learning can cluster and visualize the attributes of our extensions in a 3-D Scatterplot
Euclidean Distance can measure the distance between vector representations of our extensions to serve as a composite similarity score.

With these ideas in hand and a target browser extension (Google Translate) to start our hunting, we can quickly identify the next most similar in appearance, name, and description from the sea of more than 140,000 extensions. First, let’s break these down a bit more with some examples.

Naming Similarity

The Levenshtein Similarity is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e., insertions, deletions, or substitutions) required to change one word into the other. In this case, we invert the metric (i.e. converting distance to similarity), to match the orientation of our other similarity metrics. If we sort by our Levenshtein metric, we can quickly find the most similarly-named extensions, and their unique crx identifiers for pivoting to the full file contents.

Icon Similarity

Icon similarity is assessed using Hamming Similarity between the Color Moment Hash of two extensions. Color Moment Hash is a compact representation (a hash) of an image, based on the statistical moments of its color components. This hash is valuable for image comparison because it encapsulates significant information about the image's color distribution while being relatively insensitive to small changes or distortions in the image. Measuring the similarity between color moment hashes quickly locates exact matches to our target icon, or similar but slightly modified versions:

Description Similarity

Cosine Similarity is a metric used to compare the similarity of the description fields of text. Each text is represented as a vector (numerical encoding), where each dimension corresponds to a word from the combined set of words in both texts, and the value in each dimension corresponds to the weight of that word in the text. Cosine similarity is then used to find the cosine of the angle between these two vectors, which reflects similarity between the text fields, e.g.:

Aggregate Similarity Analysis

We used the previous three measures to hunt for similarity between names, descriptions and icon appearance individually, but a true masquerading extension could combine more than one of these attributes. To better assess our similar extensions as a whole, we will combine the top results from each category, cluster them with K-means, and then measure the distance from each point to our target vector (Google Translate), using Euclidean distance.

K-means is an unsupervised learning algorithm that partitions a dataset into k distinct, non-overlapping clusters based on the attributes of the data points – in this case, the Levenshtein, Cosine and Color Moment Hamming similarity measures. The algorithm aims to minimize the within-cluster variances and maximize the between-cluster variances, meaning that it seeks to create clusters where members of the same cluster are as similar as possible while also being as different as possible from members of other clusters. By assigning colors to different cluster values, we can visualize our data in a three-dimensional scatter plot, across each of their similarity measures.

^{3-D Scatter Plot of Levenshtein, Color Moment, & Cosine Similarity Measures}

Conclusion

In summary, we demonstrated how we can apply hashing and vectorization to encode and enrich our dataset, use algorithms to measure the similarity between vectors and use modeling to combine and cluster multiple elements of similarity. This approach allows us to start with any single extension, and quickly reduce our dataset of 140,000 extensions down to a handful of candidates for deeper reverse engineering analysis.

^{3-D Scatter Plot of Levenshtein, Color Moment, & Cosine Similarity Measures}

The notebook and data we’ve released create an independent way to explore and analyze Chrome Web Store extension data yourself. In addition, the process and metrics can be transferred to a variety of different Model-Assisted Threat Hunting or data exploration challenges using Splunk AI and DSDL. Can this approach be applied to find new threats in your environment?

Stay tuned for more updates from the SURGe team, and Happy Hunting!

Clop Ransomware Detection: Threat Research Release, April 2021

Discover how the Splunk Threat Research Team focused their research efforts on Clop Ransomware detections to help organizations detect abnormal behavior faster before it becomes detrimental.

Security 6 Min Read

ML Detection of Risky Command Exploit

Discover how to use machine learning algorithms to develop methods for detecting misuse or abuse of risky SPL commands to further pinpoint a true security threat.

Security 2 Min Read

Staff Picks for Splunk Security Reading July 2021

These monthly postings will feature the favorite security-centric presentations, white papers and customer case studies from various peeps in the Splunk (or not) security world that WE think everyone should read. If you would like to read other months, please take a peek at previous posts in the "Staff Picks" series!

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram