The recent SURGe in popularity of generative artificial intelligence tools has raised many questions around potential use cases in cybersecurity, both from an offensive and defensive point of view. Will chat-based AI-assistants provide more utility to attackers, or defenders?
Security researchers have theorized that AI-assistants can improve the efficacy and the scale of spear phishing. Is it possible that this new technology could better enable attackers to craft more sophisticated messaging, or to cross the language-barrier more effectively to target organizations outside of their native language? SURGe, Splunk’s strategic security research team, decided to look more closely at generative AI’s ability to translate email prompts from traditional Chinese, Korean, Farsi, and Russian to English. This blog will outline our methodology for this research as well as our preliminary analysis and results.
This research examines generative AI’s ability to create or translate spear phishing emails into English. We are testing the claim that this technology will empower threat actors to improve the scale and/or efficiency of their spear phishing campaigns. Our first question is whether generative AI services (large language models, or LLMs) provide a better natural language translation compared to legacy online translation services? Are there red flags that would tip off an analyst to an online translation, or some AI-generated irregularity? Our hypothesis is that a surveyed population will not be able to distinguish between AI-translated and Online-translated spear phishing content (i.e. classification prediction accuracy of approximately 50%).
Second, we will explore whether there are stylistic differences between LLMs and legacy online translation services when translating emails from various languages to English. Do the mechanics of LLMs better prevent poor translation, grammar errors, or the occurrence of fragmented or unnatural language? To test this claim, we employ a variety of natural language measures to assess the style of a text in its original, human-authored state and also its translated state from various source languages to English using legacy and generative AI translation services. Our second hypothesis is that we will observe a significant difference (more than 20%) in selected readability metrics between the categories of generative AI versus legacy translation services.
To start, we drafted three English language spear phishing emails. Four security professionals with fluent proficiency in foreign languages translated the email prompts from English to their native languages of Chinese, Korean, Farsi, and Russian. We asked these participants to translate the emails in a tone that sounded conversational, as if they were writing the email themselves. We then ran the output of those four translations through three generative AI tools, which we will refer to in this blog as Gen-AI 1, Gen-AI 2, and Gen-AI 3, and two legacy online translation tools, which we will refer to as Legacy Translator 1 and Legacy Translator 2. We instructed these tools to translate the email prompts from each language back to American English.
Prompt requesting that Gen-AI 1 translate an email from Russian to English.
Translation response provided by Gen-AI 1
To evaluate our second hypothesis, about the natural language readability of an email, we generated a scoring script that compares the legacy- and AI-generated translations across 21 different metrics encompassing grammar, readability, and urgency (scroll down for the full list of 21 metrics). The scoring used a variety of tools including the Python Natural Language Toolkit, textblob and custom code to identify features like sentence structure, punctuation, and spelling and grammar errors. We could quickly calculate many features with simple Python, such as the number of question marks or the inclusion of certain terms.
NLP Techniques to Measure Writing Quality:
Our team ingested the multilingual phishing corpus into Splunk, using the Data Science Deep Learning app to run our script, and enriching the data with more than 40 natural language metrics. To better normalize the email bodies in a Jupyter notebook, we stripped HTML and extraneous metadata. We loaded the data in Splunk with annotations for sourcing and most importantly, a unique correlation hash to sync the email with its score.
Lastly, we created a survey that includes ten of our translated emails and sent those to 100 people to answer the question: “Can the general population discern between AI-translated and legacy-translated spear phishing emails?” Participants were asked to make a classification decision on whether or not each of the ten emails were translated by AI.
To evaluate our two hypotheses, the SURGe team analyzed the survey results and a few of the key language metrics. First, we needed to determine whether the surveyed population could distinguish between AI-translated and legacy, online-translated spear phishing content. Across the 100 surveys, 48% of responses were accurate. A one-sample t-test determined that our population mean did not significantly deviate from the expected value (50% accuracy). The test failed to reject our null hypothesis, concluding there is no significant difference from chance level in the ability to distinguish between the two types of emails.
Next, we looked for significant variance in each of our ~40 language metrics across the translation tools, source languages, and destination/translated language.
How we looked for any significant variance in the data. We looked for ones that appeared to be an outlier and investigated further into each.
We were careful to only consider measures like Dale-Chall, Flesch-Kincaid, and others on English text, as these measures were originally developed for English, and may vary in other languages. For the analysis, the original emails in English (that were translated to the four languages) provide a baseline for these metrics. Evaluating the same measures after the emails are processed by generative AI models for translation enables an equal English-to-English comparison.
Radar plots per language, by generator which allows you to see the different languages laid on top of each other for easier comparisons.
From our original baseline plot, which contains language measurements based on original, human-generated spear phishing emails written in English, we can see slight deviations based on the translation platform, and the source language of the translated text. When measured as a percentage variance, several of these transformations exceed the threshold of our hypothesis, which was 20% difference.
Highlighting +/-20% differences in Natural Language Metrics Through Translation (blue=positive, green=negative)
Further analysis is needed to make conclusions about whether these significant differences are consistently representative of the effect from translating via these sources.
While translating the emails through various sources, there were some fascinating finds that we want to share. During the translation process, we noticed that some of the LLMs were better at understanding the nuances of each language. Below are some examples:
Our jobs and our adversaries' jobs are safe from AI right now. This research demonstrates that people have close to a 50% accuracy rate when determining if an email was translated by traditional methods or generative AI tools. Our metrics did not produce any significant numbers to show if traditional legacy translation tools were better at translating the emails compared to Gen-AI tools. What we did notice is that across all the metrics, Gen-AI 2 was an outlier with more mean words per paragraph, higher syllable counts, and shorter sentence length. This is something we want to dive into further.
So what’s next? Going through the data, we have found additional questions that are worth exploring. Ultimately, we hope to better understand the impact that AI may have on an adversary’s ability to craft convincing spear phishing emails in order to empower network defenders. Ideally, we hope to explore traditional and newly-evolving machine learning methodologies for grouping and identifying messages to improve detection capabilities.
Stay tuned for more content in the coming months.
As always, security at Splunk is a family business. Credit to authors and collaborators: Tamara Chacon, Ryan Kovar, James Hodgkinson, Ryan Fetterman, Audra Streetman.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.