Most of the time logs contain data which, by humans, can be easily recognized as either completely or semi-structured information. Being able to extract structure in log data is a necessary first step to further, more interesting, analysis. While it would be great to be able to automatically extract the structure from all log data, splunk cannot rival the brain’s performance at this time, however it is able to tap into your brain for help Read on ……
Extract structured information (in the form of key/field=value form) from un/semi-structured log data. Note: for the purpose of this post key or field are used interchangeably to denote a variable name.
Splunk debug message (humans: easy, machine: easy)
12-03-2007 13:51:55.114 DEBUG SearchPipelinePerformance - processor=save queryid=_1196718714_619358 executetime=0.014secs
ideal structured information to extract:
processor=save
queryid=_1196718714_619358
executetime=0.014secs
Splunk tries to make it easy for itself to parse it’s own log files (in most cases)
Output of the ping command (humans: easy, machine: medium)
64 bytes from 192.168.1.1: icmp_seq=0 ttl=64 time=2.522 ms
ideal structured information to extract:
bytes=64
from=192.168.1.1
icmp_seq=0
ttl=64
time=2.522 ms
An interesting pattern to note here is that there is no consistent field-value delimiter, nor field-value order. In the “from” field the authors have chosen to use a space as a delimiter, while for “icmp_seq”, “ttl” and “time” they’ve chosen the equal sign. For the “bytes” field they’ve chosen to place it after the value (yes, they might have also intended for it to mean bytes – the data unit) while for the rest they’ve chosen field-name followed by field-value. Admittedly, some might think the current format is prettier than the following consistent log line which could easily be parsed by machines. (Who thought log files were optimized for prettiness !?)
bytes=64, from=192.168.1.1, icmp_seq=0, ttl=64, time=2.522 ms
NetScreen log (humans: medium, machine: hard)
%MD% %DD% 13:41:25 45.2.0.1 NOC-FWa: NetScreen device_id=NOC-FWa [Root]system-notification-00257(traffic): start_time="2006-05-11 13:40:23" duration=62 policy_id=41 service=Network Time proto=17 src zone=noc-mgt dst zone=noc-svcs ......
ideal structured information to extract:
device_id=NOC-FWa
start_time=2006-05-11 13:40:23
duration=62
policy_id=41
service=Network Time
proto=17
src zone=noc-mgt
dst zone=noc-svcs
This part of the NetScreen log line …service=Network Time proto=17 src zone=noc-mgt dst zone=noc… is a salient example of the ambiguity that sometimes exists in log data. What is the correct value of service ? “Network” or “Network Time”? What about the name of the next field? Is it “Time proto” or just “proto”? Well, we can come up with an easy rule for this case, let call it Rule-1: “Field names should NOT contain spaces”. Fair/good enough!
Let’s move on to the next field, what is it’s correct name? “src zone” or just “zone”? A human can recognize that “src zone” is the correct field name, thus we just violated the our Rule-1, we can continue our cycle of adding/violating/modifying|removing rules to our rule set only to recognize that the cycle never ends – which simply translates into “there is no one solution/rule-set that is able to extract structure from ALL unstructured data” – there will always be a degenerate case that violates the rules.
Stay tuned! Links in this section are coming soon….
– Delimiter based key-value pair extraction
– Delimiter base KV extraction – advanced
Stay tuned! More links coming soon….
----------------------------------------------------
Thanks!
Ledion Bitincka
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.