In my previous Link Analysis blogs, "Visual Link Analysis with Splunk: Part 1 - Data Reduction" and "Visual Link Analysis with Splunk: Part 2 - The Visual Part," I used techniques that work well when we have a controlled data set. However, as we know, real data can be messy.
When analyzing links in fraud data, the data can be very noisy. Let’s say we want to use IP addresses for link analysis in the Splunk platform. It is not unusual for two people to share an IP address:
In this case, I really don’t want to view or review every pair of people that only share an IP address. This is an example of noise within link analysis, and although there may be real fraud here, an analyst wants to focus on bigger fish. I like to call these single links of two users, a “binary link.”
Another issue that can cause noise in link analysis is when we’re working with a really big data set. There may be a lot of these binary links that can overwhelm our visualization tool.
Using a data set containing 500,000 events and running our eventstats SPL in part two, I have 33,429 events that are “binary links.” In this subset of events, there are two usernames connected by one field, which can be a phone number, password, IP address, account, or other information. Remember we are using eventstats to count duplicates, so If there are no duplicates of a field value, then the eventstats count is 1. But anything with a duplicate (a link) will have a value greater than 1. If we have 4 fields we are searching for links, then any username with a total of 5 indicates one link.
While 33,000 may be an artificially high number, if we had only 1,000 binary links out of 500,000 or 1 million events, there would still be a lot of “noise” in our table and visualization.
Let’s work with a smaller data set (because I can’t visualize 33,000+ nodes) and see some issues when assuming this is a simple solution. Below we have our search where we have 4 fields to link on, and our comparison total is “greater than 4” (back to eventstats):
index="biglinks4"
| eventstats count as dupphone by phone
| eventstats count as dupip by ip_address
| eventstats count as duppass by password
| eventstats count as dupaccount by destination
| eval total = dupphone+dupip+duppass+dupaccount
| where total > 4
We have 12 binary groups of 2 users connected by only one attribute marked by the red arrows below:
When we zoom in on one section, you can see examples of 2 users who are only connected by an IP address:
Let’s see what happens when we simply change our search to only return those users with values greater than 5:
| eval total = dupphone+dupip+duppass+dupaccount
| where total > 5
While this looks great at first because we have no more binary links (I thought I had the solution), I realized I had actually removed useful data.
Instead, let’s zoom in on one group where our total was “greater than 4” and we had all those binary links:
One user “Eykean” is connected to this group via an IP address. Eykean, has only one link to the larger group via the IP address 158.95.244.98. When we change our total from greater than 4 to greater than 5, we lose Eykean because this user is no longer linked via IP Address 158.95.244.98 in the diagram.
To understand why this happened, let’s look at the data. If we use our search from before (total greater than 4) and we also look at only the IP address connecting Eykean to others, we see the following values:
Since Eykean only has one link, and his total is “5,” when we changed our search to only return totals greater than 5, we lose Eykean.
What I want to do is keep ALL entities that look like they are part of a binary, but are actually connected to another entity that is NOT part of a binary group, but a larger group.
To do this, we have to add additional criteria to our search and invoke Eventstats again. This time we want to look for max values of the total field and tie that back to each linked element. For example, if a user like Eykean has a total of 5, but he is connected to someone with a total that is greater than 5 (for any field) then we keep Eykean (note line numbers for explanation below).
index="biglinks4"
| eventstats count as dupphone by phone
| eventstats count as dupip by ip_address
| eventstats count as duppass by password
| eventstats count as dupaccount by destination
| eval total = dupphone+dupip+duppass+dupaccount
| where total > 4
| fields total, username, phone, password, ip_address, destination, total, dupphone, dupip, duppass, dupaccount
| eventstats max(total) AS summary_ip_total BY ip_address
| eventstats max(total) AS summary_phone_total BY phone
| eventstats max(total) AS summary_password_total by password
| eventstats max(total) AS summary_dest_total by destination
| where (summary_ip_total >5 AND total=5) or (summary_phone_total >5 AND total=5) or (summary_password_total > 5 AND total =5 ) or (summary_dest_total>5 and total=5) or total >5
I first start by looking for ALL usernames with a link to anyone else (Lines 2-7). I then invoke eventstats again (lines 9-12) and create max(total) for each of our 4 fields we are using for link analysis); If we use “Eykean” as our example we see the values of 7,5,5,5 for summary_ip_total, summary_phone_total, summary_password_total, summary_dest_total respectively. This tells us that Eykean is connected to a user by IP Address, and that user is part of a larger ring because their total is greater than 5. For phone, password, and destination the value is 5, and we know that “Eykean” has a total of 5, so “Eykean” is not connected to anyone else via those fields.
I then use the “where” conditionals (line 13) to keep only the “binary looking” members who are members of larger groups. These “binary looking” users must have a total of 5(they only have one link) AND they link to someone with a total greater than 5.
Combining this SPL with the SPL to create the link diagrams we get this:
Notice we have no binary groups:
Here is the same SPL code as above to make it easier to copy and paste for your environment – data reduction and visualization all in one:
index="biglinks4" | eventstats count as dupphone by phone | eventstats count as dupip by ip_address | eventstats count as duppass by password | eventstats count as dupaccount by destination | eval total = dupphone+dupip+duppass+dupaccount | where total > 4 | fields total, username, phone, password, ip_address, destination, total, dupphone, dupip, duppass, dupaccount | eventstats max(total) AS summary_ip_total BY ip_address | eventstats max(total) AS summary_phone_total BY phone | eventstats max(total) AS summary_password_total by password | eventstats max(total) AS summary_dest_total by destination | where (summary_ip_total >5 AND total=5) or (summary_phone_total >5 AND total=5) or (summary_password_total > 5 AND total =5 ) or (summary_dest_total>5 and total=5) or total >5 | eval from=username, to=ip_address | eval value=from, type="user" | appendpipe [| eval from=to, value=to, to=NULL, type="laptop", color="blue"] | appendpipe [ | where isnotnull(to) | eval from = from, to=phone | appendpipe [| eval from=to, value=to, to=NULL, type="phone-square", color="yellow"] | appendpipe [| where isnotnull(to) | eval from = from, to=password | appendpipe [| eval from=to, value=to, to=NULL, type="passport",color="red"] | appendpipe [| where isnotnull (to) | eval from = from, to=destination | appendpipe [| eval from=to, value=to, to=NULL, type="dollar-sign", color="green"]]]] | table username, phone, password, ip_address, destination, total, dupphone, dupip, duppass, dupaccount, color, from, to, value, type
When we zoom in, we can see that our relationship now contains “Eykean” and we didn’t have to change our initial threshold of greater than 4 to get this “non-binary” attached user.
As with everything in Splunk there are other ways to do link analysis. Another solution that was mentioned in the .conf20 session I referenced in a previous post is Sigbay Link Analysis. If you don’t want to reduce your data, and you don’t care about icons or colors, then it may be a good fit for you. Look for more to come on Sigbay Link Analysis from my colleague Gleb Esman. But here is a quick teaser screenshot:
If you made it through all three of these link analysis blogs, then you truly are a glutton for punishment. Hopefully this has been useful – and enjoyable.
As always, keep on Splunkin’!
----------------------------------------------------
Thanks!
Andrew Morris
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.