Do you love URLs? I do! This is a great way to have insight about behaviors, catch malware, and help to classify what is going on in a network.
I also have a secret: I collect them. The more I have the happiest I am! So what’s best than Splunk to analyze them?
This is the first post of a bunch on what one can do with URLs and Splunk. Please share in comments war stories, or anything you are doing with Splunk and URLs so I can enrich the upcoming posts.
First, you need to grab the Alexa list, which contains top 1 million URLs in a CSV you can download.
We add the new data source to Splunk:
Splunk automagically discovers the CSV type, and we can start searching for our URLs right away.
Now we need an App to parse our URLs properly, fortunately Splunkbase has many:
If we start looking at our data, we can run a search such as
source=”top-1m.csv.zip:./top-1m.csv” | rex field=_raw “\d+,(?<url>\S+)”
We create a field url using our regex, and then use the lookup to parse those URLs and extract useful new fields:
We can now look at the top count for domains without the attached TLD:
Which shows in this case Google, with a count of 145. That means Google appears in the top 1 million most visited URLs multiple times under various TLDs, such as:
google.com, google.om, google.li, google.co.ls, google.so, google.co.uk, etc.
If we now look at the top TLDs, it is easy to see com as a top TLD:
Amongst elements extracted, we have one field “url_url_type”, which can give various data, such as ipv6, ipv4, no_tld, unknown_tld, mozilla_tld.
The Mozilla TLD is only to show presence into the Public Suffix List. So whenever an entry appears in both “unknown_tld” and is in the top 1 million urls by Alexa, it starts to get interesting:
This is actually a TLD Romanized as rf, according to what Wikipedia can say about this one, which actually appears in the Mozilla Prefix list as following:
// xn--p1ai (“rf”, Russian-Cyrillic) : RU
// http://www.cctld.ru/en/docs/rulesrf.php
рф
But does not have the same encoding, hence showing some improvements that could be made in the lookup. Adding a Unicode to Punycode conversion?
Splunk offers a variety of apps, amongst which that can help analysts to understand more about great insight given by URLs. Happy Splunking!
----------------------------------------------------
Thanks!
Sebastien Tricaud
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.