Splunk Enterprise and Splunk Cloud Platform, along with the premium products that are built upon them, are open platforms, which allow third party products to query data within Splunk for further use case development. In this blog, we will cover using Amazon SageMaker as the ISV product using the data within Splunk to further develop a fraud detection use case to predict future risk scores.
A little while back, I wrote about how Splunk can efficiently use its platform to detect fraud by aggregating the results of violated rules’ risk scores into a risk index. To recall, what this means is that Splunk can be used to detect different types of nefarious activity, provide risk scores for that activity and then use the summarized risk scores for an entity (customer, user, account id, etc) to compare with a threshold. If the summation of the risk scores for an entity is above a user defined threshold, we can reasonably assume this entity is committing fraud or engaged in financial crime in the broader sense. The Splunk report illustrates this through a picture.
Splunk Report Showing Accumulated Risk Scores per Entity
Notice the user defined threshold line. Any summarized risk score for this time period over the threshold can be viewed as fraud. This is because we have summarized risk scores from different activities for an entity that are higher than our threshold for fraud. This avoids false positives with implementations that just use a single rule violation and provides high fidelity fraud judgment. It is the modern way we look to detect fraud at Splunk regardless of whether our customers are using Splunk Enterprise Security with risk based alerting (RBA and the Splunk App for Fraud Analytics) or building their own risk framework within Splunk Enterprise or Splunk Cloud Platform.
There is a second little arrow highlighted in the report. What is this for? Consider those entities, whose accumulated risk scores are close to the threshold, but they do not exceed it. Are they on the verge of committing fraud? They may be, but in this approach, we will not account for them because they are less than the threshold. One way to solve this is to lower the threshold. This would account for the boundary outliers, but everytime the threshold is lowered, the chances increase in getting false positives. Too many false positives put an unnecessary burden on the customers of a company and the fraud department.
Another way to solve this is to use machine learning or AI to predict that a boundary fraud score may in the near future cross the threshold and to act upon the suspected fraudster now rather than wait for more violations. A simple example would be if a threshold value was, say 58, and an entity had accumulated risk scores of 54, 55, and 56 in a three day period. A human can detect that this is an upward trend and react to it, but when we are given hundreds of thousands of entities, this becomes a harder problem to solve by simply looking at a series of reports. The basis of this article is to use a technique to predict that an entity will become a fraudster based on their past risk scores, which have not yet gone over a threshold for fraud.
Before a discussion is made on how to predict risk scores in the near future, let’s recall how this data is created and collected in Splunk.
Flow of Risk Scores
Notice that scheduled searches run against rules, which in turn use transaction data, to calculate risk scores for a fraud detection and the risk scores along with metadata are stored in a summary index. At this point, a native Splunk app or a 3rd party application can read the data from the risk index for further analysis.
In this setting, we are going to send data from Splunk to Amazon SageMaker as the third party product to use advanced techniques and machine learning to predict risk scores. Now, you may suggest that the free Splunk Machine Learning Toolkit (MLTK) should be able to do similar things and you would be correct. However, one reason Amazon SageMaker is involved is that the user may be an Amazon Web Services customer and familiar with how it works as it is made for the citizen data scientist. Using Splunk as a data aggregator and analytics engine and then allowing other tools to do more with that data makes for a compelling open system, giving our joint customers a choice on how they would like to solve a machine learning problem.
For this exercise, Splunker, Brett Roberts, was able to get up to speed quickly on the basic techniques of Amazon SageMaker to show a demo at this year's AWS re:invent. In fact, Splunk showcased this use case at two theater sessions at the conference. The basic idea was to use a scheduled AWS (Lambda) function in AWS to query aggregate risk scores (and metadata like timestamps, names of rules, and entity name) from Splunk using a REST API and place the results into Amazon (S3) buckets. Then, Amazon SageMaker would query the S3 buckets to build a model to predict future risk scores per entity. The diagram below shows this approach.
Flow of Data From Splunk Cloud to AWS SageMaker
The box on the right is AWS, which may be the same cloud instance as the Splunk Cloud Platform, in which case there would be no egress cost to send data from the Splunk Cloud Platform to AWS. Once the data is collected, Amazon SageMaker is used to create models and do numerical forecasting with the time series data. I can provide the high level steps in the diagram below for our use case to use Amazon SageMaker to predict future risk scores.
The most interesting steps above are steps 4 and 5. After creating the model, we can predict risk scores. Let’s dive deeper into step 4 as this will exemplify the use case.
The line on the graph represents predicted values. Notice that some of the actual values (the dots) are just below the prediction. These are the entities that may turn out to be fraudsters if their predicted risk score is higher than our threshold value. That is the point of this blog.
What algorithm was used to predict summarized fraud scores per entity? Several were utilized in the family of linear regression to predict the fraud score based on the provided attributes, Neural Networks and deep learning could also have been used here with a larger data set.
One interesting takeaway from this exercise was to let us know which fraud detection rule activity contributed most to the accumulated risk scores per entity. In our case it was “Excessive login failures followed by a successful login.” Knowing this, in calculating the total risk score per entity, we may want to multiply this particular risk score by a weight such as 1.5 to give it 50% more importance, as this detection rule has been shown to contribute to fraud more than others.
One last thing to note is that Amazon SageMaker works with Open Neural Networks Exchange (ONNX), which is a common format for machine learning models. If the model used in Amazon SageMaker is exported to ONNX format, then Splunk’s Machine Learning Toolkit (MLTK) can import and inference it. This allows a degree of interoperability, which lets users familiar with Splunk’s MLTK to continue working with models that were developed in other frameworks. A final point is that the free Splunk App for Data Science and Deep Learning (DSDL) can further be used by data scientists to produce even more comprehensive use cases with deep learning models giving the Splunk user a variety of choices for their solutions.
We have shown a way to bring fraud detection data into a 3rd party and let the 3rd party software add value to the initial use case. This may not be because Splunk doesn’t have capabilities to handle some of these advanced use cases, but rather a desire for a customer to use the product of their choice to further enhance a use case initiated by Splunk. If you are a user of Amazon SageMaker, please consider using the data that is already in your Splunk Enterprise or Splunk Cloud Platform instance on AWS for your forward-thinking use cases involving fraud detection. If you have not yet used Splunk for fraud detection, please start here to get started. We realize that Splunk users can also leverage Splunk MLTK for their needs, but the openness of Splunk as a data platform allows our customers to use a variety of means in their software ecosystem to get to their desired outcome. Fraud detection and prevention is not a monopoly held by any one product, and the ability to combine multiple approaches and vendors may lead to better detections, and more choices for our customers.
I would like to thank AWS for partnering with Splunk to develop this use case and letting us present it at AWS re:invent. In particular, thanks to Scott Mullins, Nick Dimtchev, and Dan Kasun at AWS for supporting this work.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.