This blog post is part twenty-three of the "Hunting with Splunk: The Basics" series. That's right, its a Brave New World of Cloud and we in Splunk Security are doubling down on it. We were thinking of releasing some ASMR podcasts on the subject, but apparently sleep teaching is actually prohibited in England (we assume something to do with GDPR.) Once again, Matt Valites brings us some nuggets from the world of AWS. Hope you enjoy! – Ryan Kovar
In the previous post on AWS data, we looked at CloudTrail data and its ability to detail infrastructure-level activity within your AWS account. Similarly, this post will look at native AWS network telemetry—VPCFlows. We’ll explore what it is, how you can ingest it, and what value it provides from a security perspective.
Per the Netflow v9 RFC 3954, a flow “is defined as a unidirectional sequence of packets with some common properties that pass through a network device." Flows lack information about packet contents but contain granular network metadata such as IP addresses, packet and byte counts, timestamps, application ports, input and output interfaces, etc. A frequently-used analogy compares Netflow data to a phone record, showing (from a network perspective):
VPCFlows are AWS' version of network flows, available per interface, subnet (including interfaces), or VPC (including subnets and interfaces). If you’re already familiar with Netflow data, awesome! However, there are some caveats and small differences to be aware of with VPCFlows which we’ll discuss later.
VPCFlows can be ingested into Splunk using a couple of methods:
S = Supported, R = Recommended
Leveraging Kinesis provides an ultra-scalable method of ingestion. The BOTS environment was small enough that for the sake of simplicity of configuration, Frothly configured Splunk to ingest VPCFlow data via the supported CloudWatch Logs input:
Detailed ingestion instructions are beyond the scope of this blog post. However, for those interested, more information can be found in the links included in the Library below.
A raw VPCFlow record looks like the following, including field definitions:
<version> <account-id> <interface-id> <srcaddr> <dstaddr> <srcport> <dstport> <protocol> <packets> <bytes> <start> <end> <action> <log-status>2 622676721278 eni-0536faba73134a9b7 13.125.33.130 172.16.0.127 40396 11211 17 1 50 1534778806 1534778866 ACCEPT OK
Let’s take a look at that data in Splunk and break it down a bit further:
Looking at this event, we are quickly able to see the following:
Who: Network communication happened between source host 13.125.33.130 on port 40396 and destination host 172.16.0.127 on port 11211.
What: 50 bytes were successfully transferred over a single packet.
Where: The communication happened on an interface in the us-west-1 region in account 622676721278.
When: The event occurred on 2018-08-20 at 15:26:46 (UTC).
Compared to the CloudTrail data that we previously examined in the last post, VPCFlow data is far more straightforward. There are 14 fields total in the VPCFlows, and the majority of them are useful from a security perspective.
Note that a flow does not equate to a network session. Recall that a flow is unidirectional, meaning that a single flow represents a portion of a network session. Requests and responses will appear as separate flow entries. We will highlight this detail later on in the blog post.
AWS imposes certain limitations on VPCFlows. The full list of caveats can be found on the VPC user guide. Some of the more pertinent limitations include:
How can we use high-level network flow telemetry for security detection and investigation? From a monitoring perspective, there are some straightforward use-cases including detecting traffic to never-before-seen IPs or ports or detecting large volume transfers. A slightly more clever use is to detect service amplification abuse, often used in Denial of Service (DoS) attacks. Details of amplification attacks and some of the services upon which they rely have been heavily documented elsewhere. In summary, the attack relies on an Internet-exposed unsecured UDP service that allows an attacker to send a small amount of data and generate a larger request (hence the ‘amplification') to a spoofed victim.
2017 seemed to be the year of Memcached amplification attacks, providing attackers with amplification factors of 10,000-51,000! Meaning, if an attacker sends 1 megabyte of spoofed payload, they can generate 10-51 gigabytes of attack traffic. We first need to understand how Memcached works to understand how to leverage it for amplification attacks. From Memcached’s website, the service is a, “is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.” It’s intended purpose to speed-up dynamic web content. Memcached works at a high level as follows:
Memcached’s default configuration is inherently weak since it does not utilize authentication or authorization for successful use. Like many UDP protocols susceptible to amplification attacks Memcached was not intended to be exposed to the Internet. However, if you build it, they will come. Once exposed, attackers can test for the presence of a Memcached server (often using the ‘stat' command) and circumvent the intended frontend service, seeding the cache with their key (i.e., payload) before the request the payload be sent to a spoofed victim.
Understanding the above, how can we use VPCFlow data to help detect amplification attacks? Let’s look at flows for one bidirectional communication between a client and Memcached server:
sourcetype=aws:cloudwatchlogs:vpcflow (src_port=11211 OR dest_port11211) | head 4 | table _time, duration, account_id, region, interface_id, src, src_port, dest, dest_port, bytes, protocol, packets, vpcflow_action
Here we can see four flows between an external host 13.125.33.130 and internal RFC1918 host 172.16.0.178 running Memcached. As we mentioned earlier in the post, flow IP addresses do not necessarily reflect the address in the network packet. There is no way for an external host to communicate with an AWS-hosted RFC1918 address, so the actual network communications must have come through a public IP address such as a Load Balancer.
Starting from the bottom of the table, the first connection is from 13.125.33.130 over UDP from port 22222 to the default memcached service port of 11211. The connection lasted 59 seconds over which 5072550 bytes were sent in 3422 packets. In response, host 172.16.0.178 sent 513727 bytes in 137 packets back to host 13.125.33.130.
Four seconds later, we see another connection between the same two hosts between the memcached service port and UDP port 40396. This time 50 bytes are sent from the external host over 1 packet, with the internal host responding with 51327 over 36 packets.
What’s going here? To help explain these flows, we’ll pull in the corresponding wire data from Splunk Stream:
sourcetype=stream:udp (src_port=40396 OR src_port=11211 OR src_port=22222) | head 2 | eval short_src_content=substr(src_content,1,75) | eval short_dest_content=substr(dest_content,1,75) | table _time, bytes_in, bytes_out, src, src_port, dest, dest_port, short_src_content, short_dest_content
Whereas flows are a collection of unidirectional packets, Stream shows the complete network connection. That difference manifests itself with two events in Stream as opposed to the four events from VPCflows. The source, destination, ports, and timestamps correspond directly to the flow events. The bytes values in both data sources are close but vary slightly due to IP overhead such as packet options. The additional telemetry that we see in the Stream data is the source and destination content. The first event shows:
The Memcached server returns a value of ‘STORED’ indicating the data has been successfully cached. This single event shows an attacker loading the exposed Memcached server with a payload. The second Stream event ‘get’s the same key that was just cached. In an actual attack, this source address of the ‘get' command would be spoofed to the value of the victim's IP. Since the requestor address is identical in both events, the activity is likely an attacker testing capabilities before launching an actual attack.
To detect UDP amplification, what we're most interested is the bytes. Looking at the flow records corresponding to the attack command (i.e. ‘get injected') in Stream, we see 50 bytes sent, and 51327 bytes received, equating to an amplification factor of nearly 1027! In this example, the attacker limited the cached payload to 50000 bytes. Imagine what would have happened were the limit to be doubled or even tripled? This shows how abusing a misconfigured UDP service such as Memcached can be a highly efficient means of an attacker sending a small amount of data but generating a relatively large amount of attack traffic.
The above example shows how VPCflows can be used to investigate network activity, but it's equally effective at detecting this type of activity. Were this to be an actual investigation, we could use the knowledge gained to set up proactive monitoring via saved searches to detect future signs of UDP Amplification attacks, such as:
----------------------------------------------------
Thanks!
Matthew Valites
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.