A revolution is happening in the world of Natural Language Processing (NLP) and code generation. In May 2021, Microsoft unveiled Power Apps Ideas, its first product feature powered by the language model GPT-3. In June 2021, GitHub announced GitHub Copilot, a code editor extension that auto-completes code based on existing code and comments. And just a month ago (February 2022), DeepMind released a system called AlphaCode for solving programming challenges that can rank in the top 54% of a coding competition.
These developments sparked our curiosity as to whether we could build our own “Copilot” for Splunk’s Search Processing Language (SPL). With a “Copilot” for Splunk SPL, our users could write a description of what they want to achieve in plain English and get suggested queries to execute. For example, for the English description “get number of Windows security events by user”, our “Copilot” might suggest the SPL query “sourcetype=windows_security | stats count by user”. Such a “Copilot” could make SPL more accessible for a wide variety of users and help them get to the results they need faster.
We made considerable progress and shared our learnings in our session at the NVIDIA GTC 2022 conference. In this accompanying blog, we’ll describe our research collaboration with the team at NVIDIA Morpheus, an open application framework for cybersecurity providers. We’ll walk through the data acquisition, model fine-tuning, and optimization. Although we have more research to do, we hope our learnings will help other engineers build their own “Copilot” for their products.
Splunk users interact with our products through a powerful yet complex domain-specific language called Search Processing Language (SPL). This language offers a lot of flexibility, allowing users to search and analyze machine data, security events, and observability logs. However, there is a learning curve for new SPL users, and even for more advanced users it takes time to craft the correct query. We believe our users could benefit from a product feature that translates plain English to the appropriate query.
The scope of SPL includes data searching, filtering, modification, manipulation, insertion, deletion, visualization, and more. There are analogies between SQL and SPL; SQL is used to manage and search relational database tables composed of columns, whereas SPL is designed to search indexed events composed of fields.
Semantic parsing is the task of translating natural language to a logical and formal machine-understandable form. Text-to-SQL is one such example, with practical applications such as building a natural language interface for databases. Semantic parsing involves several challenges, in particular evaluating semantic equivalence (similar descriptions for the same query, similar queries for the same description), ensuring predicted code executability, dealing with a lack of parallel data (natural language / code translation pairs), and dealing with data context and interdependencies. Choosing a good evaluation metric is very important: for example, exact string matching can result in overestimated false negatives, while execution result matching can result in overestimated false positives (as two different queries can by chance return the same result on some data).
Under the hood, our “Copilot” uses a model that translates a plain English description to a corresponding SPL query, and it can suggest several queries to execute by sampling from this model at inference time. We’ll now describe the technical details of the translation model we used, and how we fine-tuned it.
The Transformer architecture, introduced in 2017 in the highly-cited paper Attention Is All You Need, is an encoder-decoder deep learning model for NLP that uses only attention mechanisms and has proved very successful for many tasks.
The image is from the original paper: Attention Is All You Need
The Generative Pretrained Transformer 2 (GPT-2) is a publicly available pretrained model introduced by Open AI in February 2019 with a final release in November 2019. It is based on a decoder-only Transformer architecture, and was trained on the non-public WebText dataset, a dataset of 40GB of text from webpages. GPT-3 is a non-public, larger version of GPT-2 introduced by OpenAI in May 2020. EleutherAI produced and made publicly available in March 2021 a replica of the GPT-3 architecture, called GPT-Neo, along with the 885GB training dataset called the Pile.
The Text-to-Text Transfer Transformer (T5) is a publicly available pretrained model introduced by Google in February 2020. It is a standard encoder-decoder Transformer trained on the C4 dataset, a 750GB collection of English texts from the public Common Crawl web scrape with extensive deduplication as well as use of heuristics to extract only natural language.
In Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, the team behind the T5 model compared various model architectures and found that the standard encoder-decoder Transformer performed quite well on both language modeling and translation. Decoder-only models such as GPT were found to be more appropriate for generation. Additionally, the best-performing solution on the Spider leaderboard, PICARD, first fine-tunes a large version of T5 and then constrains the model predictions. We decided to use T5 as the English-to-SPL translation model, as T5 also has the advantage of being a much smaller model (compared to GPT-3 and Codex) while still being trained on a large amount of data.
More specifically, we used the following pretrained versions hosted on the Hugging Face model hub: t5-small, the smaller pretrained model released by Google, and codet5-small, which shares the same model architecture but was pretrained by Salesforce on code data.
Without any pre-existing English/SPL translation pair dataset, we had to get creative. Thankfully, there were several data sources for us to get started. Our goal was to get as many English/SPL pairs as possible, ideally with real SPL queries from users.
The first source we found was the Splunk Community Forum, a StackOverflow-like forum where Splunk users can post and answer questions, typically about SPL queries. We kept only the questions marked as solved and where the accepted solution contained exactly one code block, and we used the English question as the potential translation for the SPL query. We then scraped all the Splunk Online Documentation. We parsed the HTML pages looking for SPL queries (searching for <code> tags), and used the text that came right before a query as the potential English translation. We also downloaded around 150 apps from Splunkbase written by developers, and scraped all the SPL queries from their source code. Some SPL queries had a comment describing the query, and we used that as the potential English translation. Additionally, we downloaded several SPL manuals like Exploring Splunk, where many of the queries came with a great description, and we copy-pasted the English/SPL pairs by hand because it was hard to parse the PDF. Finally, we used the GoSplunk SPL Database, a public database of SPL queries where many posts also came with English descriptions for the SPL queries.
After we gathered thousands of translation pairs, we applied several cleaning routines (e.g. unicode formalization), filtered out non-SPL code and deduplicated translation pairs. After initial model experiments, we learned that data quality is essential. We manually reviewed and corrected a small sample (due to time constraints) of the pairs, sorting them by length as we hypothesized that the model would perform better on short sentences.
Below, we summarize each source with the number of samples:
Dataset | # Scraped Examples | # Manually Reviewed / Labeled Examples |
82,030 | 494 | |
682 | 439 | |
1,735 | 432 | |
324 | 300 | |
609 | 42 | |
Total | 85,380 | 1,707 |
As we can see, we ended up with 1,707 high quality pairs. Because we didn’t have more time to manually review more data, we started to get creative on data augmentation.
We had an insight that if we could convert large benchmark English-to-SQL datasets into “English-to-SPL” datasets, then we could quickly expand our training dataset.
First, we used regular expressions to translate simple SQL queries into SPL queries. We generated 80k high quality English/SPL pairs by translating the SQL queries in WikiSQL into SPL queries. For example:
description | What is the current series where the new series began in June 2011? |
original SQL query | SELECT col4 AS result FROM table_1_1000181_1 WHERE col5 = "new series began in june 2011" |
human readable SQL query | SELECT Current series FROM table WHERE Notes = New series began in June 2011 |
human readable SPL query | sourcetype=table | where Notes = New series began in June 2011 | fields Current_series |
For more complex queries such as the ones in the Spider dataset, regular expressions were too hard to write. Thankfully, Splunk’s search team has a SQL-to-SPL compiler we could use. This way, we generated 8k more complex and high-quality pairs from Spider.
Here’s an example:
English | List all the businesses with more than 4.5 stars |
SQL | SELECT BUSINESSalias0.NAME FROM BUSINESS AS BUSINESSalias0 WHERE BUSINESSalias0.RATING > 4.5 ; |
SPL (from compiler) | index=BUSINESS | where RATING>4.5 | fields NAME |
Fine-tuning is performed on a single p3.2xlarge GPU instance. Sentences are converted to lowercase before being tokenized and padded with a maximum length set to 128. The model is fine-tuned using a batch size of 64, half-precision, Adam optimizer with a learning rate of 1e-3 without weight decay, and until the validation loss no longer improves (for the English/SPL pairs, after 5 epochs, resulting in approximately 10 total epochs).
To make sure our code and model (t5-small or codet5-small) are working as expected, we first trained and evaluated T5 on the WikiSQL and Spider datasets. To mimic our current setting for English/SPL, the model is only provided with the natural language description of the query. Because in this setting we use neither the database schema nor the database content, it is harder for the model to predict the correct table and column names (and impossible if they are never provided in any description).
Because our model is only provided with the natural language description of the query, we relax the problem to predicting the “human readable” queries instead of the original queries. The “human readable” queries have more semantic meaning but they are not executable because they lack query identifiers. Thus, our model does not have to predict the correct table and column names. In this context, we can’t directly compare our model performance with the WikiSQL leaderboard.
We use the original 70% train, 10% validation and 20% test split. Both t5-small and codet5-small perform as expected and are able to learn the simple syntax of the queries. This performance can be explained by the simple syntax pattern of the queries, the shortness of the sentences, and the problem relaxation to “human readable” queries. Pretraining on code data (codet5-small) hasn’t improved the model performance, probably because of the simplicity of the query syntax.
Model | Exact Match | |
t5-small | 76.0 | 38.4% |
codet5-small | 74.8 | 37.2% |
Because there was not a significant difference between t5-small and codet5-small on the WikiSQL dataset, and due to time constraint, we only experimented with t5-small on the Spider dataset. Our model is fine-tuned to predict the original query. Because our model is only provided with the natural language description of the query, we reshuffle data to avoid the zero-shot setting (so some table and column names of the dev set can be seen by our model at train time). For benchmark comparison, we additionally use the test suite available for Spider. The best performing solutions on the Spider leaderboard achieve 75% execution accuracy, but they notably make use of the database contents (and additional tools such as SQL parsers along with much larger models). In this primary work, we have not invested the time to use all the post-processing techniques that competitors (e.g. PICARD) have, because our primary goal is predicting SPL and not SQL. We are planning to use them in future work for more model validation and to compare our model predictions with their model predictions.
Given the added complexity, our model performs well, with almost a quarter of predicted queries executable and correct. But our model is flawed in that it predicts queries that can’t be executed, and the principal cause, as expected, is unknown database and column names.
Model | BLEU Score | Exact Match | Executable and Correct | Not Executable |
t5-small | 43.9 | 12.8% | 24.0% | 66.8% |
Based on the test suite results below, we also confirm that the model is performing better on easier queries (the difficulty label is provided in this dataset).
Test suite results
After these two experiments, we were comfortable moving forward with the T5 architecture.
After filtering out the few long sentences (to set the maximum number of tokens at 128 without truncation), our English/SPL dataset, not augmented with any Text-to-SQL converted datasets, has 1,635 translation pairs (down from 1,707). Given the small size of our dataset, metrics are 5-fold cross-validated.
Based on our initial results below, there is a slight increase of performance in using codet5-small (probably due to the code richness of the SPL query syntax) which wasn’t observed on WikiSQL. We decided to use codet5-small for the rest of the experiments.
Model | BLEU Score | Exact Match |
t5-small | 35.0 +/- 1.8 | 12.2% +/- 1.7 |
codet5-small | 39.4 +/- 5.7 | 17.7% +/- 4.8 |
To simplify training, we hypothesized that only the decoder really needed to be fine-tuned. We verified this hypothesis by freezing the embedding layer and all but the last layer in the encoder stack. As shown in the table below, this model achieves almost the same performance as the unfrozen model, but with only around 28M trainable parameters (down from around 60M trainable parameters), it is 30-50% faster to train.
Model | BLEU Score | Exact Match |
codet5-small+freeze | 39.2 +/- 3.4 | 17.7% +/- 2.5 |
To improve generated code quality, we used the Splunk Python SDK to filter syntactically-invalid queries at inference time. The model is constrained to predict the most likely valid prediction. Similarly to AlphaCode, we also measured the performance of a model that can return and be evaluated on its 10 most likely predictions.
Model | BLEU Score | Exact Match | Exact Match Within Top 10 |
codet5-small | 42.9 | 20.2% | 28.7% |
Further analyzing the model’s performance, we confirmed our intuition that the model is performing better when it needs to predict short queries and when it is given short English descriptions. The small bump on the right-hand graph at the end of the distribution is probably due to high variability in the metric calculation (approximately 0.9% only of the pairs are in this bucket, and 86.8% of the pairs are in the first 3 buckets).
Based on our results, we decided to use codet5-small without freezing any layers.
Note: we also augmented our training dataset using the Text-to-SQL converted datasets (randomly sampling from each of them to keep the datasets balanced) and then fine-tuned the codet5-small model, but doing so actually resulted in a slight decrease in the model performance. We’ll investigate this further, but most likely the converted SQL/SPL queries do not contain information that is very helpful for the model which is tested on the hand-labeled queries. These hand-labeled queries contain a lot of Splunk-specific functions that don’t exist in WikiSQL or Spider. However, we think WikiSQL and Spider contain different ways of expressing logical operators that could be useful for our model.
Splunk’s Search Tutorial is one of the best resources for learning SPL, and it provides data on which to run the learned queries. We use some English/SPL pairs extracted from this tutorial to showcase a few examples where the model performs well and where it fails.
Some predicted queries are “almost correct”, meaning they may only be missing an operation sign, a search command parameter, or a command. Even though these predictions are not executable directly, we expect that they would still be helpful for a user, for whom figuring out how to fix one of these queries would be easier than coming up with one from scratch. For example:
description | search the sourcetype field for any values that begin with access_. get events with status 200, action "purchase". then compute the most common categoryId values |
target | sourcetype=access_* status=200 action=purchase | top categoryId |
prediction | sourcetype=access_* status=200 actionpurchase | top categoryId |
description | search the sourcetype field for any values that begin with access_. get events with status 200, action "purchase". then compute the one most common clientip |
target | sourcetype=access_* status=200 action=purchase | top limit=1 clientip |
prediction | sourcetype=access_* status=200 actionpurchase | top clientip |
description | search the sourcetype field for any values that begin with access_. then get price as "Price" by productName, then rename productName column as "Product Name" |
target | sourcetype=access_* | stats values(price) AS Price BY productName | rename productName AS "Product Name" |
prediction | sourcetype=access_. | stats values(price) as price by productName | |
A few predictions are bad, or don’t even make sense. As we previously saw, that might be because the English description and/or the query are long, or that the model has not learned some mathematical or logical operations. For the former, we could try splitting the long (and probably nested) queries into shorter ones. For the latter, we could try adding more examples. For example:
description | search the sourcetype field for any values that begin with access_, status 200, and action "purchase". use the chart command to count the number of purchases by using action="purchase". The search specifies the purchases made for each product by using categoryId. The difference is that the count of purchases is now an argument of the sparkline() function. |
target | sourcetype=access_* status=200 action=purchase | chart sparkline(count) AS "Purchases Trend" count AS Total BY categoryId | rename categoryId AS Category |
prediction | index=access_* status=200 action=purchase | chart count(price) |
description | search for the terms error, fail, failure, failed, or severe, in the events that also mention buttercupgames |
target | buttercupgames (error OR fail* OR severe) |
prediction | index=_internal fail* not [ search index=_internal error, fail* error |
Note: for almost all the pairs extracted from this tutorial, the correct query is contained within the top 10 predictions, which motivates the idea of giving several suggestions among which the user can choose.
We collaborated on this project with the NVIDIA Morpheus team who have experience deploying large language models at scale and with very low latency requirements (needed for real-time use of our model). We already collaborated with them last year for machine logs parsing, a work that produced two talks at the NVIDIA GTC 2021 conference (Parsing Machine Logs Faster And Cheaper With Triton and Parsing Machine Logs with Machine Learning) along with a blog post.
Morpheus is a developer framework for building high-performance cybersecurity workflows extremely quickly. Morpheus handles the orchestration, distribution, and monitoring of the model pipeline. Developers can swap in many different data sources and models, as their need changes, without interruption of the pipeline: for example, switching from input files to streaming input. They can also take advantage of GPU-accelerated libraries like NVIDIA RAPIDS and TensorRT.
The image is from: NVIDIA Morpheus product page
The first step in moving our pipeline into Morpheus is to think about it in terms of stages. In our NLP setting, data is read (from file, from Kafka, etc.), pre-processed and tokenized, then fed to the model for inference, and the predictions are post-processed. Morpheus contains reusable parallelized pre-compiled stages for many pipelines, such as tokenizers, and it’s also possible to build custom stages.
Once the stages are ready, the Morpheus pipeline can be built via CLI. Under the hood, Morpheus will efficiently take care of the orchestration, management of all resources (CPU/GPU/Network), communication between distributed pipelines, and monitoring of the throughput and model drift over time. It automatically adjusts buffer sizing between stages to optimize hardware saturation, maximize throughput, and minimize latency. The backend engine is in C++ for optimized performance, but there is an easy-to-use Python interface.
While we are still researching and developing our solution, we can easily update the pipeline, for example switching from file I/O to streaming input via Kafka topics.
Read and write to file during prototyping using from-file and to-file
Read and write to Kafka topic in production using from-kafka and to-kafka
TensorRT is an SDK developed by NVIDIA “for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications”. By converting the codet5-small PyTorch model from Hugging Face to a TensorRT engine, we can take advantage of optimizations like minimizing GPU memory footprint by reusing memory, fusing layers and tensors, eliminating transpose operations, and selecting the appropriate data layers based on hardware. We can save the model, in ONNX format, as a stage inside Morpheus, and at deployment time, Morpheus will convert the model to the appropriate TensorRT engine based on the GPU hardware.
Unlike GPT, which is an autoregressive decoder only, T5 has both an encoder and an autoregressive decoder. According to this technical blog from NVIDIA, the currently suggested way to turn T5 into an optimized TensorRT engine for inference is to convert the encoder and decoder separately. They report a 21x smaller latency for TensorRT inference compared to Pytorch CPU using the T5-3B model. Read the blog for more details about the conversion and the optimizations made.
Latency is a critical requirement for real-time applications; if not taken into account, it has a significant negative impact on the user experience. Our goal was a full pipeline latency of 100 milliseconds or less, which feels instantaneous to human perception. We ran experiments using t5-small to compare inference latency on different runtimes using 1,200 English/SPL translation pairs of varying lengths. We used a single V100 32GB GPU with a Dual 20-core Intel Xeon CPU, and ran experiments using a batch of size 1. In our dataset, the average number of input tokens was 32 and output tokens was 21. For our target length experiments we compared the latency using 32 input tokens, and for our source length experiments we compared the latency using 32 output tokens. The pipeline was on average 5x times faster using TensorRT in Morpheus than Pytorch CPU. With longer English sentence inputs, like 260 and 505 tokens, we see a 10x speed up compared to Pytorch CPU.
Hopefully, this blog will be useful for engineers interested in building a “Copilot” for their own products. We still have more research to do to overcome some of the model’s limitations: most notably, it only performs well on simple queries with concise and clear English descriptions that include index and field names. To achieve this, we think we should focus in particular on improving data quality and increasing dataset size, leveraging more data and context along with using existing SPL parsers, implementing post processing techniques to fix “almost correct” predictions, and providing explanations for the suggested queries (e.g., a description of what a query is doing).
If you have any questions, please feel free to reach out to us. Thank you!
This blog was co-authored by Julien Veron Vialard, Abraham Starosta, and Rachel Allen.
Julien Veron Vialard is an Applied Scientist at Splunk where he has been working on Machine Learning with Graphs and Natural Language Processing problems. He received in June 2021 his M.S. in Computational and Mathematical Engineering from Stanford University, where his research focused on biostatistics and convolutional neural nets for medical imaging. Prior to Stanford, Julien interned in quantitative trading firms.
Abraham Starosta is a Senior Applied Scientist at Splunk where he works on Natural Language Processing and other ML problems. Prior to Splunk, Abraham was an NLP engineer at high growth technology startups like Primer and Livongo, and interned at Splunk in 2014. He completed his B.S and M.S in Computer Science from Stanford, where his research focused on weak supervision and multitask learning.
Rachel Allen is a senior cybersecurity data scientist on the Morpheus team at NVIDIA. Her focus is the research and application of GPU-accelerated machine learning methods to help solve information security challenges. Prior to NVIDIA, Rachel was a lead data scientist at Booz Allen Hamilton where she designed a variety of capabilities for advanced threat hunting and network defense. She holds a bachelor’s degree in cognitive science and a PhD in neuroscience from the University of Virginia.
Special thanks to Kristal Curtis (Engineering Manager), Joseph Ross (Senior Principal Applied Scientist) and Donald Thompson (Distinguished Engineer) for their encouragement, help, and ideas.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.