According to the Association of Certified Fraud Examiners, the money lost by businesses to fraudsters amounts to over $3.5 trillion each year. The ACFE's 2016 Report to the Nations on Occupational Fraud and Abuse states that proactive data monitoring and analysis is among the most effective anti-fraud controls. Organizations that undertake proactive data analysis techniques, experience frauds that are up to 54% less costly and 50% shorter than organizations that do not monitor and analyze data for signs of fraud. As fraudsters continue to adapt and utilize new methods, it is important to leverage machine learning and data science algorithms to fight fraud. Detecting anomalies and outliers through machine learning, utilizing adaptive thresholds and other advanced techniques are the next wave in fraud detection and prevention. What about carrying out those advanced analytics with the help of Splunk’s Data-to-Everything platform and support clients to reduce fraud impact? Even in the case that ACFE figures are exaggerated, the market opportunity is huge.
Here at Splunk, we have helped many customers across a range of different industries in their fight against fraud: whether that is to help detect financial fraud as described here by Haider or to support in the fight against opioid abuse through monitoring for fraudulent diversion of controlled substances. More recently we have also shown how you can detect fraudulent credit card transactions using our new Splunk Machine Learning Environment (SMLE). Furthermore, we also offer the free Splunk Security Essentials for Fraud Detection app that covers many use cases such as healthcare insurance billing, payments, etc.
During the previous months, Splunk has been approached by various multinational sports betting companies to help them streamline their fraud prevention and detection processes to provide the Revenue Assurance team with a 360º view of fraud.
One of the new necessities we came across several times was that the clients were seeking to get a sports betting fraud risk scoring model to be able to quickly detect fraud. For that purpose, I designed a data pipeline to create a sports betting fraud risk scoring model based on anomaly detection algorithms built with probability density function powered by Splunk’s Machine Learning Toolkit.
This article showcases the solution that can be built with Splunk in very little time with the application of your clients’ current data.
Credit to Greg Anslie and Raúl Marín for their valuable help in the design of the pipeline and their wise insights while I was creating this content. Thank you guys!
The plan to carry out the solution setup is as follows:
To accelerate the time to value in the proposed solution, Splunk indexes data exported from the relational databases that contain data about the different sports events. This data would have been stored through a traditional batch ETL process which transformed the sources' raw data into traditional SQL-type tables. At the end of this article, a set of next steps will be suggested including accessing directly to the data sources without an intermediate SQL DB.
A common practice when developing ML models is to divide your data set into two pieces, one for ML training and another to test the model itself. In this case, let’s imagine that we have 12 months worth of data and that we will use the first 11 months for the model training and 1 month for model testing.
The data pipeline that performs data indexing, transformation, ML model training, ML model application and finally provides dashboarding and investigation capabilities are as follows:
Data pipeline
Note that since the data enters Splunk, the data pipeline depicts transformations in the data itself and not the underlying HW/SW architecture. In order to perform the various transformations in the data (indexing, enriching, summary, ML) only the indexer and search head components of Splunk’s super scalable architecture are required and only SPL language will be needed across the data pipeline. These make the solution simpler to build and maintain than other data pipelines made of different pieces of SW.
In the following figure you can see the part of the pipeline to which this section is dedicated:
Data pipeline: data ingestion
The data ingestion will be carried out in Splunk from a set of sports betting database tables. There are many ways to perform this but the most common one is to use the Splunk DB Connect app, which helps you quickly integrate structured data sources with your Splunk real-time machine data collection. Database import functionality from the Splunk DB Connect app allows you to import tables, rows, and columns from a database directly into Splunk Enterprise, which indexes the data. You can then analyze and visualize that relational data from within Splunk Enterprise just as you would the rest of your Splunk Enterprise data.
In the following figure you can see the part of the pipeline to which this section is dedicated:
Data pipeline: correlation, enrichment and KPIs calculation
Each data source will be an SQL-type table. After analyzing the relationship between tables the correlations will be performed. Note an example of relationships between fields of different tables in a SQL-type table.
Sports betting data model
These correlations will be made entirely in Splunk through basic SPL commands. As several fields need to be correlated from several tables the chosen option is using eventstats and stats commands, relating fields from one table to another with eval command. SPL language is perfectly suited for correlating time series and the number of lines of code needed will be exponentially lower than those if using SQL language for performing those correlations. If you are interested in knowing the code structure for performing the correlation have a look at this example took from Splunk answers to correlate different fields from 3 tables:
Assumptions:
Sample code:
KPIs can be calculated using the eval command. An example of a subset of KPIs calculation:
Data can be enriched with useful information such as betting room name and geolocation of the betting rooms provided by the client using Splunk’s lookup functionality which allows you to enrich your event data by adding field-value combinations from lookup tables:
From the correlated data, a set of KPIs like the following can be constructed per sports event (note that this is just an example of interesting KPIs):
# BETS |
TAKE (%) |
% IDENTIFIED BETS |
# WON BETS |
% WINNING BETS |
# BETS > 100/200€ |
# LOST BETS |
% € WINNING BETS |
# DIFFERENT FORECASTS |
TOTAL BETTED (€) |
% BETS HIGH FEES |
# DIFFERENT MARKETS |
AWARDS DISTRIBUTED (€) |
% HIGH FEES WON BETS |
# HIGH POTENTIAL AWARD OBTAINED BETS |
GROSS WIN (€) |
% LIVE BETS |
% BETS MADE WITHOUT SUPERVISION |
Sports events KPIs
The correlated, enriched data and their KPIs at the sports event level should be transferred to a new summary index to accelerate the consumption of analytics by dashboards and machine learning algorithms using the collect command, but the index with the raw data will remain intact since it could be necessary for the realization of investigative searches:
In the following figure you can see the part of the pipeline to which this section is dedicated:
Data pipeline: fraud scoring model training
Now we will create a fraud risk scoring model based on anomaly detection in the different KPIs calculated in the previous section. To do that we will take 11 months of data and train the anomaly detection model. The ML tool to be used will be Splunk's Machine Learning Toolkit. The anomaly detector will be created for each KPI and each league based on its probability density function.
The probability density function determines the probability of a value being in a certain range based on past information. Basically, it generates a baseline for your data. This makes it a great tool for finding anomalies as it allows you to quickly determine if data sits in an expected range or not and you can find out more about this algorithm at this blog about finding anomalies with Splunk.
A sample code to generate the baselining for each KPI taking 11 months of data, using the fit command and evaluating the summary index created in the previous section is something like this (only a small subset of the KPIs has been included for simplicity reasons):
Note that each KPI modeled with the fit command indeed will create one submodel per each different League. That makes sense because each League has different betting patterns, therefore different probability density functions and finally different baselinings. More parameters like "League" can be added to have a better granularity of behaviours for each KPI. But keep in mind the right balance between computing efficiency and granularity.
So once we have our baselining for each KPI and each league that will allow us to detect anomalies and in the next step we will create a scoring based on the number of anomalies of the event that will account for the fraud risk. The idea behind this is simple: the more anomalies in the sport event KPIs, the bigger the risk of fraud.
In the following figure you can see the part of the pipeline to which this section is dedicated:
Data pipeline: fraud scoring model application
At this point, the anomaly detector will be tested with 1 month of data not used in its training. For each event, a score will be generated that accounts for its fraud risk by adding the anomalies detected in the different KPIs of the event. Let’s see some examples to make it clearer:
To fine-tune the scoring model, each anomaly KPI should have different weights based on its relative impact on fraud risk. For example an anomaly associated with #BETS could have a weight of 1.5 and an anomaly associated with # BETS > 100/200€ could have a weight of 2. As a first approximation, all KPIs have weight 1.
A sample code to use the apply command to the remaining 1 month of data to test the model (again this code will include only a small subset of the KPIs for simplicity reasons):
The sample code for the creation of the fraud risk score with the outputs of the apply command by using eval command:
In the following figure you can see the part of the pipeline to which this section is dedicated:
Data pipeline: control dashboards generation
Through Splunk's dashboarding capabilities, two dashboards have been generated:
Some example snapshots of the dashboards that I generated with simulation data:
Sports Betting Fraud Dashboard:
Dashboard detail of KPIs by sports event:
The benefits proven during the exercise have been the following:
On the other hand the solution proposed is not intended to be the final production solution but a first setup to accelerate time to value. As explained in the proposed solution, Splunk indexes data exported from the relational databases that contain data about the different sport events that would have been stored through a traditional batch ETL process which transformed the sources raw data into traditional SQL-type tables. As a consequence, we would not enjoy having raw data from sources available in Splunk for building new use cases nor Splunk’s real time indexing.
What I would recommend to continue maturing this initial setup:
Happy Splunking,
Lucas
----------------------------------------------------
Thanks!
Lucas Alados
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.