Last year, with the release of GitHub Copilot and other LLMs, we noticed that a silent revolution was beginning to unfold in the world of Natural Language Processing and Code Generation. We took advantage of these recent developments to build “SPL Copilot for Splunk (beta),” a language model fine-tuned on the task of translating from English to Splunk’s Search Processing Language (SPL) queries. SPL allows our users to search and analyze their machine data, security events, and observability logs. But SPL has a learning curve, and this beta version released at .conf22 helps lower the entry barrier.
Since last year, we have witnessed a surge of innovation from both the research community and private research labs, and the adoption of AI into everyday tasks has started to spread rapidly, like wildfire. Thanks to early adoption and continuous product improvement, we were able to improve the SPL Copilot for Splunk to provide a much richer and guided experience to Splunk and SPL and renamed it “Splunk AI Assistant.”
In the example above, a security analyst wants to find windows security logs with failed login events. The model knows which index and sourcetype to use, and knows it has to filter by EventCode=4625. It also provides a step by step explanation of the predicted SPL query, and suggests a list of related Splunk Documentation contents.
SPL is a powerful but complex domain specific language. New users have a learning curve to use SPL and advanced users can have challenges unlocking the real power of SPL. They may need to dig through documentation or search the internet for hints and examples. To help customers write SPL queries using natural language prompts, we released the “SPL Copilot for Splunk (beta)” last year on Splunkbase. This beta version received a heartwarming welcome while demoed to customers at .conf22. We hosted a session with about 50 people and demoed it on the floor. At that time, our simple translation model was super impressive. Still, there were missing pieces, and ChatGPT raised the bar considerably. Customers want to interact with AI assistants in a casual manner, and they expect the assistant to support many more tasks, at a higher level of accuracy.
Building on our success from last year, the Splunk AI Assistant can do much more:
We also integrated an open-book question answering system within the assistant, so users can directly get insights without having to search our extensive documentation.
Customers could use ChatGPT to write SPL queries using English prompts. So why use Splunk? With ChatGPT, you will be sending potentially sensitive information to an external provider; using a built-in and Splunk owned solution on the contrary has the same high level of privacy and security that Splunk customers trust. Isn’t there a good open-source replication? After conducting a couple experiments across proprietary and open-source models, we actually found that almost none of them provide satisfying SPL queries. ChatGPT / GPT4 is the only exception and it still suffers from hallucination: for example inventing new search commands or arguments. The hallucinations most likely happen because there aren’t many public examples of SPL that could have been scraped automatically and used as part of these large models training, resulting in confusion of SPL with other languages more represented in the training data.
In addition, an assistant without specific knowledge of our products isn’t of much use to our customers. While building Splunk AI assistant, we taught it Splunk-specific knowledge.
Last year, we used the Text-to-Text Transfer Transformer (T5), a publicly available pretrained model introduced by Google in February 2020. It is a standard encoder-decoder Transformer trained on the C4 dataset, a 750GB collection of English texts from the public Common Crawl web scrape. We fine-tuned a 60M parameter model version called codet5-small on about 2k training examples of English to SPL translation. Such fine-tuning can be done on a single V100 GPU for a couple of dollars. We decided to refresh our codet5-small model by mixing different training objectives (e.g. writing an SPL query from an English description, generating a multi-step English description of an SPL query, etc.) and augmenting our training set with synthetically generated data and user generated data from Splunk employees. It resulted in a training set that was 300 times larger than last year.
While expanding our training set, we decided to explore training more recent and much larger language models too. We began by exploring several open source LLMs to determine a baseline which performed the best at zero-shot SPL generation. Among all the open source models, StarCoder and StarCoder Plus came first. These are 15B parameter language models developed by BigCode and trained specifically for code generation and code completion.
To train such large models, we initially used more recent A100 GPUs with 40GB of memory. GPU memory during training is mostly used for model weights, optimizer states, gradients, forward activations saved for gradient computation, and temporary buffers. For StarCoder trained in single precision without optimization, it amounts to 240GB plus forward activations and temporary buffers. We used Pytorch's Fully Sharded Data Parallel (FSDP) to shard the model weights, gradients, and optimizer states across 8 A100 GPUs (each with 40GB of memory) in a single node instance. FSDP gave us fine-grained control over how we could wrap and distribute our model. To avoid running out of memory, we further reduced our memory footprint by using gradient checkpointing. Training in half-precision and in mixed-precision can also speed up training while reducing memory overhead. Since A100 GPUs support bfloat16, loss scaling isn’t required for mixed-precision. Mixed-precision effects on memory actually vary depending on model architecture: there is a trade-off between storing an additional copy of the model weights in fp16 and storing activations in fp16 instead of fp32.
In theory, mixed-precision could reduce memory usage by up to half, enabling a doubled batch size. We also experimented with DeepSpeed ZeRO-3 since HuggingFace's Accelerate implementation is very convenient and supports additional fine-tuning techniques such as LoRA. LoRA is a parameter efficient fine-tuning technique that freezes the initial weights and introduces a small amount of trainable new weights. With LoRA, we effectively trained ~0.22% of our model parameters. Later on we got access to A100 GPUs with 80GB of memory, which helped us improve our time to convergence alongside the DeepSpeed optimizations.
To turn the model into a conversational assistant, we further fine-tuned it on dialogue data formatted with special tokens, like ChatML. Also, to make training more compute efficient, sentences are concatenated together to generate chunks of a specific size, instead of padding each sentence.
We additionally built a Retriever-Reader system for Splunk documentation. To do so, we scraped Splunk documentation, and fine-tuned pretrained embedding models from Sentence Transformers on the scraped documentation.
Dogfooding is common practice at Splunk, and Splunk employees are the best positioned to teach our model what they know. Taking inspiration from Databricks Dolly 2.0, we developed an internal web portal and let employees interact with our models and provide feedback. Model predictions are anonymized, and users are able to rank these predictions as well as suggest corrections. Their feedback is stored in Amazon RDS and we integrated these preferences into our training loop.
A standard metric for automatic evaluation of machine translation systems for natural language is BLEU. BLEU is a corpus-level metric based on a modified n-gram precision measure between predicted translation and reference translation(s). It is not very suitable in the context of code generation, since changing even one character can turn flawless code into code that doesn’t execute. That’s why last year we also evaluated predicted queries using exact string matching. But exact string matching can result in overestimated false negatives, since there are multiple ways to formulate a correct query. We also needed metrics to evaluate the newest capabilities of the assistant.
We attempted human evaluation of each prediction, but found this to be too slow and arduous. Human evaluations are also biased: for example, they vary depending on who is evaluating and their level of expertise. Recent works such as Chatbot Arena have been using ELO scoring to benchmark LLMs against each other. LLMs act as players in randomized pairwise comparisons, and relative performance is inferred from wins, losses, and draws against other players. This rating system has been widely adopted in chess and other competitive games.
We initially opted for ELO scoring using an LLM as a judge. Using an LLM as a judge also shows biases; however, these biases can be calibrated. An LLM evaluation can be sensitive to the order in which candidate answers to be ranked are shown to the LLM. For pairwise comparisons, this position bias can be calibrated by also evaluating a reverse ordering of the two candidate answers and averaging scores from the two orderings. The evaluation of the candidate answers is also more consistent when the LLM is required to provide multiple pieces of evidence (i.e. correct syntax, appropriate search commands, concise responses, no hallucination, etc.) to support their subsequent evaluation.
ELO scoring is in practice very sensitive to the order in which comparisons are made. Additionally, since this scoring is based on pairwise comparisons, for N models to compare against each other, 2Ncomparisons need to be made for each evaluation input, which is prohibitively expensive. We decided instead to evaluate each model against the expected output, by asking our LLM judge directly for a score from 0-10 and then reporting the relative performance of each model against the expected output.
Relative Performance Against Expected (10 trials) |
|
StarCoder (2023) |
95.5% (1.2) |
CodeT5-small (2023) |
77.7% (1.2) |
CodeT5-small (2023, 8-bit quantized) |
70.06% (1.3) |
CodeT5-small (2022) |
42.1% (1.5) |
To evaluate the question answering system, we randomly sampled ~2,500 documents from our corpus and generated one question for each document. The retriever system had a precision of 94%. Looking at the top-1 retrieved document: in 70% cases it is the document on which the question was asked, and in 88% cases it contains the answer to the question being asked. This evaluation approach has some biases, since the questions are generated based on the corpus documents, and in future work we’ll account for this bias.
We hope you enjoyed hearing about our process of building an AI assistant. We still have more research to do: for example, to incorporate customer environment data, and to support other programming languages we use at Splunk such as SignalFlow. Due to current limitations, only the codet5-small version of Splunk AI Assistant can be shipped within Splunkbase; we are working on full experience delivery for our customers.
If you have any questions, please feel free to reach out to us. Thank you!
This blog was co-authored by Julien Veron Vialard, Robert Riachi, Abe Starosta, and Om Rajyaguru.
Julien Veron Vialard is a Senior Applied Scientist at Splunk where he has been working on training and deploying language models for code generation, question answering, and named entity recognition. He has experience in conducting research and collaborating with product teams. He received in June 2021 his M.S. in Computational and Mathematical Engineering from Stanford University, where his research focused on convolutional neural nets for medical imaging. Prior to Stanford, Julien interned in quantitative trading firms.
Robert Riachi is an Applied Scientist at Splunk working on training and deploying LLMs for conversational agents, and code generation. Prior to his time at Splunk, he built Generative ML models at Bloomberg, and he completed his Bachelor’s of Mathematics from the University of Waterloo, Double Majoring in Computer Science and Statistics.
Abe Starosta is a Senior Applied Scientist at Splunk where he works on natural language processing and anomaly detection. Prior to Splunk, Abraham was an NLP engineer at high growth technology startups like Primer and Livongo, and interned at Splunk in 2014. He completed his B.S and M.S in Computer Science from Stanford, where his research focused on weak supervision and multitask learning.
Om Rajyaguru is an Applied Scientist at Splunk working primarily on time series clustering problems, along with methods to fine-tune and evaluate large language models for code generation tasks. He received his B.S. in Applied Mathematics and Statistics in June 2022, where his research focused on multimodal learning and low-rank approximation methods for deep neural networks.
Special thanks to Vedant Dharnidharka (Director of Engineering - Machine Learning) and the entire ML team for their encouragement, help, and ideas.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.