Security

December 14, 2023

6 Minute Read

Old School vs. New School

By Tamara Chacon

The recent SURGe in popularity of generative artificial intelligence tools has raised many questions around potential use cases in cybersecurity, both from an offensive and defensive point of view. Will chat-based AI-assistants provide more utility to attackers, or defenders?

Security researchers have theorized that AI-assistants can improve the efficacy and the scale of spear phishing. Is it possible that this new technology could better enable attackers to craft more sophisticated messaging, or to cross the language-barrier more effectively to target organizations outside of their native language? SURGe, Splunk’s strategic security research team, decided to look more closely at generative AI’s ability to translate email prompts from traditional Chinese, Korean, Farsi, and Russian to English. This blog will outline our methodology for this research as well as our preliminary analysis and results.

Open Questions

This research examines generative AI’s ability to create or translate spear phishing emails into English. We are testing the claim that this technology will empower threat actors to improve the scale and/or efficiency of their spear phishing campaigns. Our first question is whether generative AI services (large language models, or LLMs) provide a better natural language translation compared to legacy online translation services? Are there red flags that would tip off an analyst to an online translation, or some AI-generated irregularity? Our hypothesis is that a surveyed population will not be able to distinguish between AI-translated and Online-translated spear phishing content (i.e. classification prediction accuracy of approximately 50%).

Second, we will explore whether there are stylistic differences between LLMs and legacy online translation services when translating emails from various languages to English. Do the mechanics of LLMs better prevent poor translation, grammar errors, or the occurrence of fragmented or unnatural language? To test this claim, we employ a variety of natural language measures to assess the style of a text in its original, human-authored state and also its translated state from various source languages to English using legacy and generative AI translation services. Our second hypothesis is that we will observe a significant difference (more than 20%) in selected readability metrics between the categories of generative AI versus legacy translation services.

Methodology

To start, we drafted three English language spear phishing emails. Four security professionals with fluent proficiency in foreign languages translated the email prompts from English to their native languages of Chinese, Korean, Farsi, and Russian. We asked these participants to translate the emails in a tone that sounded conversational, as if they were writing the email themselves. We then ran the output of those four translations through three generative AI tools, which we will refer to in this blog as Gen-AI 1, Gen-AI 2, and Gen-AI 3, and two legacy online translation tools, which we will refer to as Legacy Translator 1 and Legacy Translator 2. We instructed these tools to translate the email prompts from each language back to American English.

^{Prompt requesting that Gen-AI 1 translate an email from Russian to English.}

^{Translation response provided by Gen-AI 1}

To evaluate our second hypothesis, about the natural language readability of an email, we generated a scoring script that compares the legacy- and AI-generated translations across 21 different metrics encompassing grammar, readability, and urgency (scroll down for the full list of 21 metrics). The scoring used a variety of tools including the Python Natural Language Toolkit, textblob and custom code to identify features like sentence structure, punctuation, and spelling and grammar errors. We could quickly calculate many features with simple Python, such as the number of question marks or the inclusion of certain terms.

NLP Techniques to Measure Writing Quality:

Our team ingested the multilingual phishing corpus into Splunk, using the Data Science Deep Learning app to run our script, and enriching the data with more than 40 natural language metrics. To better normalize the email bodies in a Jupyter notebook, we stripped HTML and extraneous metadata. We loaded the data in Splunk with annotations for sourcing and most importantly, a unique correlation hash to sync the email with its score.

Lastly, we created a survey that includes ten of our translated emails and sent those to 100 people to answer the question: “Can the general population discern between AI-translated and legacy-translated spear phishing emails?” Participants were asked to make a classification decision on whether or not each of the ten emails were translated by AI.

Results

To evaluate our two hypotheses, the SURGe team analyzed the survey results and a few of the key language metrics. First, we needed to determine whether the surveyed population could distinguish between AI-translated and legacy, online-translated spear phishing content. Across the 100 surveys, 48% of responses were accurate. A one-sample t-test determined that our population mean did not significantly deviate from the expected value (50% accuracy). The test failed to reject our null hypothesis, concluding there is no significant difference from chance level in the ability to distinguish between the two types of emails.

Next, we looked for significant variance in each of our ~40 language metrics across the translation tools, source languages, and destination/translated language.

How we looked for any significant variance in the data. We looked for ones that appeared to be an outlier and investigated further into each.

We were careful to only consider measures like Dale-Chall, Flesch-Kincaid, and others on English text, as these measures were originally developed for English, and may vary in other languages. For the analysis, the original emails in English (that were translated to the four languages) provide a baseline for these metrics. Evaluating the same measures after the emails are processed by generative AI models for translation enables an equal English-to-English comparison.

Radar plots per language, by generator which allows you to see the different languages laid on top of each other for easier comparisons.

From our original baseline plot, which contains language measurements based on original, human-generated spear phishing emails written in English, we can see slight deviations based on the translation platform, and the source language of the translated text. When measured as a percentage variance, several of these transformations exceed the threshold of our hypothesis, which was 20% difference.

^{Highlighting +/-20% differences in Natural Language Metrics Through Translation (blue=positive, green=negative)}

Further analysis is needed to make conclusions about whether these significant differences are consistently representative of the effect from translating via these sources.

Interesting Finds

While translating the emails through various sources, there were some fascinating finds that we want to share. During the translation process, we noticed that some of the LLMs were better at understanding the nuances of each language. Below are some examples:

Gen-AI 1:

Persian (Farsi) is written from right to left and some words/concepts might be adjusted during translation to fit the cultural context while preserving the original message.
We tried to convey the meaning and content of the original text as closely as possible. Due to the cultural and structural differences between English and Farsi, some sentences might read differently when translated.
Gen-AI 1 and Gen-AI 3 wrote the Farsi language from left to right, instead of from right to left.

Gen-AI 2:

Kept a few of the translations as close to the original as possible with a clear and concise output.
Made minor changes to improve how the email flowed in English so it would sound more natural.
The only AI tool that provided a subject line without being prompted.
Had the most adjustments to account for cultural nuances when translating from Farsi to English. Here are a few examples:
- "با کمال تاسف" (with regret) is translated as "We regret to inform you"
- "گزارشی مبنی بر تخلفات مربوط به پرداخت مالیات بر دارایی ملک شما" (A report of infractions related to paying taxes on your property) is translated as "a report of a violation related to your property tax payment"
- "در آدرس زیر" (at the address below) is translated as "at the following address"
- "تاریخ خرید" (Purchase Date) is translated as "Date of purchase"
- "شماره سند" (Document Number) is translated as "Deed number"
- "شماره ی صندوق" (Fund Number) is translated as "Lot number"
- "نصب غیرمجاز صفحه ی خورشیدی" (unauthorized installation of solar panel) is translated as "the unauthorized installation of a solar panel"
- "تعدیل مالیات بر دارایی" (adjustment of property tax) is translated as "an adjustment to your property tax"
- "تاریخ سررسید" (maturity date) is translated as "due date"
- “چک الکترونیکی" (E-Cheque) is translated as "e-check”

Legacy Translator 2:

Only had Persian as a translation, not the specific dialect of Farsi.

Conclusion

Our jobs and our adversaries' jobs are safe from AI right now. This research demonstrates that people have close to a 50% accuracy rate when determining if an email was translated by traditional methods or generative AI tools. Our metrics did not produce any significant numbers to show if traditional legacy translation tools were better at translating the emails compared to Gen-AI tools. What we did notice is that across all the metrics, Gen-AI 2 was an outlier with more mean words per paragraph, higher syllable counts, and shorter sentence length. This is something we want to dive into further.

What's Next

So what’s next? Going through the data, we have found additional questions that are worth exploring. Ultimately, we hope to better understand the impact that AI may have on an adversary’s ability to craft convincing spear phishing emails in order to empower network defenders. Ideally, we hope to explore traditional and newly-evolving machine learning methodologies for grouping and identifying messages to improve detection capabilities.

Stay tuned for more content in the coming months.

As always, security at Splunk is a family business. Credit to authors and collaborators: Tamara Chacon, Ryan Kovar, James Hodgkinson, Ryan Fetterman, Audra Streetman.

Detecting New Domains in Splunk (Finding New Evil)

Ready to find "new" domains that may be naughty? We'll walk you through how to use Splunk & Splunk Enterprise Security to do that: get the full story here!

Security 2 Min Read

Introducing ATT&CK Detections Collector

Automate and simplify finding detections against ATT&CK techniques used by adversaries with Splunk SURGe's open-sourced project, ATT&CK Detections Collector (ADA).

Security 1 Min Read

Shifting Mindsets: Modernizing the Security Operations Center

How to go from an 'old school' to a 'new school' defender

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram