So you've set up a Google Cloud Logging sink along with a Dataflow pipeline and are happily ingesting these events into your Splunk infrastructure — great! But now what? How do you start to get meaningful insights from this data? In this blog post, I'll share eight useful signals hiding within Google Cloud audit logs that will help you uncover meaningful insights. You'll learn how to detect:
Finally, we’ll wrap up with a simple dashboard that captures all these queries in one place. Let's get started!
Google Cloud audit logs provide extensive information about what's happening within your cloud projects. There are two general categories of these audit logs: "Admin Activity" and "Data Access."
An "Admin Activity" log is intended to capture events such as creating, updating, or modifying a cloud resource. For example, Cloud Spanner will log instance and database creation events using this audit log type. In general, only administrative activity by a user, service account, or Google robot account will cause "Admin Activity" logs to generate.
In contrast, "Data Access" logs are emitted when a user accesses data contained within a Google Cloud product. For example, when a user performs a SQL query against a Spanner database, both the access event and the query performed are captured in a "Data Access" audit log.
Google turns data access logs off by default, allowing customers to selectively enable them when needed. I encourage you to review Google's exhaustive list of cloud services that support each type of audit log and determine what’s best for your use case.
Note that not all services support both audit log types. Admin Activity audit logs are enabled for all Google Cloud services and can't be configured. For the purposes of this blog, we'll be focusing on these "admin activity" logs.
While there is a general JSON representation of a "LogEntry" object in Google's reference documentation, a specific schema structure for all audit log events doesn't exist. (Well, at least not that I've been able to find!) You can think of the documented structure as more of a general container for numerous embedded schema types.
The following is an example Compute Engine "Admin Activity" audit log:
{ "insertId": "by2p8sdcrmw", "logName": "projects/redacted-151018/logs/cloudaudit.googleapis.com%2Factivity", "operation": { "id": "operation-1609787710025-5b817e89eb0fa-8881f3ea-c2110d78", "last": true, "producer": "compute.googleapis.com" }, "protoPayload": { "@type": "type.googleapis.com/google.cloud.audit.AuditLog", "authenticationInfo": { "principalEmail": "redacted-automation@redacted-151018.iam.gserviceaccount.com" }, "methodName": "beta.compute.instances.insert", "request": { "@type": "type.googleapis.com/compute.instances.insert" }, "requestMetadata": { "callerIp": "redacted", "callerSuppliedUserAgent": "google-api-go-client/0.5 Terraform/ (+https://www.terraform.io) Terraform-Plugin-SDK/2.0.1 terraform-provider-google-beta/dev,gzip(gfe)" }, "resourceName": "projects/redacted-151018/zones/us-central1-b/instances/dev-splunk-fwd-0-fa1f685", "serviceName": "compute.googleapis.com" }, "receiveTimestamp": "2021-01-04T19:15:28.005258468Z", "resource": { "labels": { "instance_id": "5860031931766200273", "project_id": "redacted-151018", "zone": "us-central1-b" }, "type": "gce_instance" }, "severity": "NOTICE", "timestamp": "2021-01-04T19:15:26.961372Z" }
In the example above, the "protoPayload" section of the event contains a "request.@type" field which further specifies the schema type of "protoPayload" itself. Each type of audit operation can emit its own object structure into an event, making Splunk ideal for navigating and reporting on this type of data. Splunk's "schema on read" approach will allow you to search, transform and visualize cloud audit logs in ways you may not know until you start exploring what data you have!
The example SPL contained in this article assumes that Google Cloud events have been pushed to a Splunk HTTP Event Collector (HEC) using the Dataflow method described in Google's documentation. Please see the footnote section of this post if you are using the Pub/Sub input feature of the Splunk Add-on for Google Cloud Platform to ingest logs instead.
Finally, regardless of ingestion method, please ensure you change the "index=main" references to the actual index name which contains your Google Cloud log events. If you aren't sure which index you are using, please refer to the "Data Inputs > HTTP Event Collector" section of the Splunk interface.
Service account events are one of the most important activities in Google Cloud audit logs. The following queries will help you keep track of both the creation and exporting of service account credentials.
What service accounts have been created and by whom? This SPL will generate a table of those events.
index=main resource.type="service_account" protoPayload.methodName="google.iam.admin.v1.CreateServiceAccount" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | rename protoPayload.requestMetadata.callerIp as "Source IP" | rename protoPayload.requestMetadata.callerSuppliedUserAgent as "User Agent" | rename protoPayload.response.email as "Service Account Email" | rename protoPayload.response.project_id as Project | table _time, "Principal Email", "Source IP", "User Agent", Project, "Service Account Email"
An exported service key is a non-expiring static credential whose generation and subsequent use can easily go undetected. Sounds scary, right? Use this SPL to keep a watchful eye for key export events. You'll likely want to alert on the event so you can track down the principal immediately and determine if their use case truly warrants the risk. In general, if you are running the workload inside Google Cloud, exporting a key shouldn't be necessary.
index=main resource.type="service_account" protoPayload.methodName="google.iam.admin.v1.CreateServiceAccountKey" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | rename protoPayload.requestMetadata.callerIp as "Source IP" | rename protoPayload.requestMetadata.callerSuppliedUserAgent as "User Agent" | rename protoPayload.response.name as "Key Name" | rename protoPayload.response.valid_after_time.seconds as "Valid After" | rename protoPayload.response.valid_before_time.seconds as "Valid Before" | eval "Valid After"=strftime('Valid After', "%F %T") | eval "Valid Before"=strftime('Valid Before', "%F %T") | eval "Private Key Type" = case('protoPayload.request.private_key_type' == 0, "Unspecified", 'protoPayload.request.private_key_type' == 1, "PKCS12", 'protoPayload.request.private_key_type' == 2, "Google JSON credential file") | table _time, "Principal Email", "Source IP", "User Agent", "Key Name", "Private Key Type", "Valid After", "Valid Before"
Please note that when using "strftime" in the "eval" command that you'll want to pay close attention to surrounding variables containing spaces using ' rather than ".
For many organizations, compute is the bread-and-butter service they consume in Google Cloud. Here are a few queries you may find useful.
The following SPL is a great way to get an "activity feed" of create, update, and delete events in your compute environment. Did Sally just launch 1000 virtual machines using her favorite n1-standard-64 instance template? Maybe we should ask her to scale that back!
index=main resource.type="gce_instance" operation.first=true | regex protoPayload.methodName="^\w+?\.compute\.instances\.\w+?$" | rename protoPayload.methodName as "API Method" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | rename protoPayload.requestMetadata.callerIp as "Source IP" | rename protoPayload.requestMetadata.callerSuppliedUserAgent as "User Agent" | rename protoPayload.resourceName as "Resource Name" | table _time, "Principal Email", "Source IP", "User Agent", "API Method", "Resource Name"
Sometimes you just want to know who deleted stuff. Similar to the previous query, this SPL will help you answer the question of "which users deleted virtual machines in the last week?" Maybe send them a thank you note for cleaning up after themselves!
index=main resource.type="gce_instance" operation.first=true earliest=-7d latest=now | regex protoPayload.methodName="^\w+?\.compute\.instances\.delete$" | rename protoPayload.methodName as "API Method" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | rename protoPayload.requestMetadata.callerIp as "Source IP" | rename protoPayload.requestMetadata.callerSuppliedUserAgent as "User Agent" | rename protoPayload.resourceName as "Resource Name" | table _time, "Principal Email", "Source IP", "User Agent", "API Method", "Resource Name"
One unique feature that differentiates Google Compute Engine from its competitors is its live migration capability. Using this feature, Google is able to perform regular infrastructure maintenance and upgrades without causing impactful outages or requiring customers to relaunch workloads on new hosts. It's also how Google was able to patch their entire Compute Engine fleet without impact to customers prior to the announcement of Spectre and Meltdown. While mostly hitless, these live migrations will introduce a brief "blackout" event on virtual machines that can last several seconds. It is useful to understand when these events occur, especially if you find yourself investigating moments of response time latencies further up your application stack. Use the following SPL to track down live migration events in your audit logs.
index=main resource.type="gce_instance" protoPayload.methodName="compute.instances.migrateOnHostMaintenance" | rename protoPayload.resourceName as Instance | table _time, Instance
Did you know you can simulate a live migration event through the Google Cloud API? See "gcloud help compute instances simulate-maintenance-event" for more information. This is a great way to simulate a live migration event under controlled circumstances in order to observe its impact to your application.
Google tries to detect early warning signs of failing Compute Engine hardware and perform proactive live migrations to healthy hardware when possible. Unfortunately even Google can't save your virtual machines from every failure scenario. Use the following SPL to track down host fault and automatic restart events along with your own manual reset events. This may help explain why your monitoring system detected a brief outage.
index=main resource.type="gce_instance" | regex protoPayload.methodName="^\w+?\.?compute\.instances\.(hostError|automaticRestart|reset)$" | rename protoPayload.resourceName as Instance | rename protoPayload.methodName as Event | table _time, Event, Instance
For those getting started in the cloud, Identity and Access Management (IAM) can be one of the most confusing and complex subject areas to understand. With its default Project Editor IAM permissions, the Compute Engine default service account is a bit of a shortcut Google provides around understanding IAM for simple workload use cases. The default service account is enabled on all instances created by the gcloud command-line tool and the Cloud Console unless specifically overridden. Google cautions against its use but also acknowledges that deleting it "might cause any applications that depend on the service account's credentials to fail." This is especially important in a brownfield cloud environment where the default service account might be widely used already. It's also highly likely that if a machine is using the default service account, the actual IAM permissions these machines require is not well understood. Further, even if you do manage to create a new service account with just the right IAM permissions, you still must shutdown the instance and incur an outage in order to swap in the new dedicated service account.
So even if you can't fully purge yourself of the default service account, it sure would be nice to find new instances being launched with it. Use the following SPL to track down those events and help those users create a dedicated service account.
index=main resource.type="gce_instance" "protoPayload.request.serviceAccounts{}.email"="default" | regex protoPayload.methodName="^\w+?\.?compute\.instances\.insert" | rename protoPayload.methodName as "API Method" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | rename protoPayload.requestMetadata.callerIp as "Source IP" | rename protoPayload.requestMetadata.callerSuppliedUserAgent as "User Agent" | rename protoPayload.resourceName as "Resource Name" | rename "protoPayload.request.serviceAccounts{}.email" as "Service Account" | table _time, "Principal Email", "Source IP", "User Agent", "Resource Name", "Service Account"
A generic list of create, update, and delete events grouped by project is a great high-level way to review activity across your organization. This SPL will generate that report for you.
index=main "protoPayload.@type"="type.googleapis.com/google.cloud.audit.AuditLog" operation.first="true" | regex protoPayload.methodName="\S*\.([Ii]nsert|[Cc]reate|[Uu]pdate|[Dd]elete)\S*$" | rename protoPayload.methodName as "API Method" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | rename resource.labels.project_id as "Project" | rename protoPayload.resourceName as "Resource Name" | stats list("Principal Email"), list("API Method"), list("Resource Name") by Project
Google Cloud permits granting access to projects to users outside a project's own organization. It's always good to keep a watchful eye on these activities. This SPL will show you project-wide IAM bindings being granted to accounts that aren't within your own domain or within your project's service account domain. Make sure you replace splunk.com and the service account address with your own domains.
index=main "protoPayload.@type"="type.googleapis.com/google.cloud.audit.AuditLog" | spath "resource.type" | search "resource.type"="project" | spath protoPayload.serviceData.policyDelta.bindingDeltas{}.member output=Member | search Member=* | regex Member!="(.*@splunk\.com|.*@redacted\.iam\.gserviceaccount.com)" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | spath "resource.labels.project_id" | rename "resource.labels.project_id" as "Project" | table _time, "Principal Email", Project, Member
Let's wrap things up by capturing the aforementioned queries into a simple dashboard. Having these in one place should help you get started with your GCP auditing efforts.
<dashboard> <label>Matt's Super Cool Dashboard</label> <row> <panel> <title>Service Account Creation Events</title> <table> <search> <query>index=main resource.type="service_account" protoPayload.methodName="google.iam.admin.v1.CreateServiceAccount" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | rename protoPayload.requestMetadata.callerIp as "Source IP" | rename protoPayload.requestMetadata.callerSuppliedUserAgent as "User Agent" | rename protoPayload.response.email as "Service Account Email" | rename protoPayload.response.project_id as Project | table _time, "Principal Email", "Source IP", "User Agent", Project, "Service Account Email"</query> <earliest>-24h@h</earliest> <latest>now</latest> <sampleRatio>1</sampleRatio> </search> <option name="count">20</option> <option name="dataOverlayMode">none</option> <option name="drilldown">none</option> <option name="percentagesRow">false</option> <option name="rowNumbers">false</option> <option name="totalsRow">false</option> <option name="wrap">true</option> </table> </panel> </row> <row> <panel> <title>Service Account Key Creation Events</title> <table> <search> <query>index=main resource.type="service_account" protoPayload.methodName="google.iam.admin.v1.CreateServiceAccountKey" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | rename protoPayload.requestMetadata.callerIp as "Source IP" | rename protoPayload.requestMetadata.callerSuppliedUserAgent as "User Agent" | rename protoPayload.response.name as "Key Name" | rename protoPayload.response.valid_after_time.seconds as "Valid After" | rename protoPayload.response.valid_before_time.seconds as "Valid Before" | eval "Valid After"=strftime('Valid After', "%F %T") | eval "Valid Before"=strftime('Valid Before', "%F %T") | eval "Private Key Type" = case('protoPayload.request.private_key_type' == 0, "Unspecified", 'protoPayload.request.private_key_type' == 1, "PKCS12", 'protoPayload.request.private_key_type' == 2, "Google JSON credential file") | table _time, "Principal Email", "Source IP", "User Agent", "Key Name", "Private Key Type", "Valid After", "Valid Before"</query> <earliest>-24h@h</earliest> <latest>now</latest> <sampleRatio>1</sampleRatio> </search> <option name="count">20</option> <option name="dataOverlayMode">none</option> <option name="drilldown">none</option> <option name="percentagesRow">false</option> <option name="rowNumbers">false</option> <option name="totalsRow">false</option> <option name="wrap">true</option> </table> </panel> </row> <row> <panel> <title>Compute instance create, update, delete operations</title> <table> <search> <query>index=main resource.type="gce_instance" operation.first=true | regex protoPayload.methodName="^\w+?\.compute\.instances\.\w+?$$" | rename protoPayload.methodName as "API Method" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | rename protoPayload.requestMetadata.callerIp as "Source IP" | rename protoPayload.requestMetadata.callerSuppliedUserAgent as "User Agent" | rename protoPayload.resourceName as "Resource Name" | table _time, "Principal Email", "Source IP", "User Agent", "API Method", "Resource Name"</query> <earliest>-24h@h</earliest> <latest>now</latest> <sampleRatio>1</sampleRatio> </search> <option name="count">20</option> <option name="dataOverlayMode">none</option> <option name="drilldown">none</option> <option name="percentagesRow">false</option> <option name="rowNumbers">false</option> <option name="totalsRow">false</option> <option name="wrap">true</option> </table> </panel> </row> <row> <panel> <title>Compute delete operations (last 7 days)</title> <table> <search> <query>index=main resource.type="gce_instance" operation.first=true earliest=-7d latest=now | regex protoPayload.methodName="^\w+?\.compute\.instances\.delete$$" | rename protoPayload.methodName as "API Method" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | rename protoPayload.requestMetadata.callerIp as "Source IP" | rename protoPayload.requestMetadata.callerSuppliedUserAgent as "User Agent" | rename protoPayload.resourceName as "Resource Name" | table _time, "Principal Email", "Source IP", "User Agent", "API Method", "Resource Name"</query> <earliest>-24h@h</earliest> <latest>now</latest> <sampleRatio>1</sampleRatio> </search> <option name="count">20</option> <option name="dataOverlayMode">none</option> <option name="drilldown">none</option> <option name="percentagesRow">false</option> <option name="rowNumbers">false</option> <option name="totalsRow">false</option> <option name="wrap">true</option> </table> </panel> </row> <row> <panel> <title>Live migrated hosts</title> <table> <search> <query>index=main resource.type="gce_instance" protoPayload.methodName="compute.instances.migrateOnHostMaintenance" | rename protoPayload.resourceName as Instance | table _time, Instance</query> <earliest>-24h@h</earliest> <latest>now</latest> <sampleRatio>1</sampleRatio> </search> <option name="count">20</option> <option name="dataOverlayMode">none</option> <option name="drilldown">none</option> <option name="percentagesRow">false</option> <option name="rowNumbers">false</option> <option name="totalsRow">false</option> <option name="wrap">true</option> </table> </panel> </row> <row> <panel> <title>Host errors</title> <table> <search> <query>index=main resource.type="gce_instance" | regex protoPayload.methodName="^\w+?\.?compute\.instances\.(hostError|automaticRestart|reset)$$" | rename protoPayload.resourceName as Instance | rename protoPayload.methodName as Event | table _time, Event, Instance</query> <earliest>-24h@h</earliest> <latest>now</latest> <sampleRatio>1</sampleRatio> </search> <option name="count">20</option> <option name="dataOverlayMode">none</option> <option name="drilldown">none</option> <option name="percentagesRow">false</option> <option name="rowNumbers">false</option> <option name="totalsRow">false</option> <option name="wrap">true</option> </table> </panel> </row> <row> <panel> <title>Virtual machines launched with the default service account</title> <table> <search> <query>index=main resource.type="gce_instance" "protoPayload.request.serviceAccounts{}.email"="default" | regex protoPayload.methodName="^\w+?\.?compute\.instances\.insert" | rename protoPayload.methodName as "API Method" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | rename protoPayload.requestMetadata.callerIp as "Source IP" | rename protoPayload.requestMetadata.callerSuppliedUserAgent as "User Agent" | rename protoPayload.resourceName as "Resource Name" | rename "protoPayload.request.serviceAccounts{}.email" as "Service Account" | table _time, "Principal Email", "Source IP", "User Agent", "Resource Name", "Service Account"</query> <earliest>-24h@h</earliest> <latest>now</latest> <sampleRatio>1</sampleRatio> </search> <option name="count">20</option> <option name="dataOverlayMode">none</option> <option name="drilldown">none</option> <option name="percentagesRow">false</option> <option name="rowNumbers">false</option> <option name="totalsRow">false</option> <option name="wrap">true</option> </table> </panel> </row> <row> <panel> <title>Resource create, delete, and update operations</title> <table> <search> <query>index=main "protoPayload.@type"="type.googleapis.com/google.cloud.audit.AuditLog" operation.first="true" | regex protoPayload.methodName="\S*\.([Ii]nsert|[Cc]reate|[Uu]pdate|[Dd]elete)\S*$$" | rename protoPayload.methodName as "API Method" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | rename resource.labels.project_id as "Project" | rename protoPayload.resourceName as "Resource Name" | stats list("Principal Email"), list("API Method"), list("Resource Name") by Project</query> <earliest>-24h@h</earliest> <latest>now</latest> <sampleRatio>1</sampleRatio> </search> <option name="count">20</option> <option name="dataOverlayMode">none</option> <option name="drilldown">none</option> <option name="percentagesRow">false</option> <option name="rowNumbers">false</option> <option name="totalsRow">false</option> <option name="wrap">true</option> </table> </panel> </row> <row> <panel> <title>External members granted access to organization</title> <table> <search> <query>index=main "protoPayload.@type"="type.googleapis.com/google.cloud.audit.AuditLog" | spath "resource.type" | search "resource.type"="project" | spath protoPayload.serviceData.policyDelta.bindingDeltas{}.member output=Member | search Member=* | regex Member!="(.*@splunk\.com|.*@redacted\.iam\.gserviceaccount.com)" | rename protoPayload.authenticationInfo.principalEmail as "Principal Email" | spath "resource.labels.project_id" | rename "resource.labels.project_id" as "Project" | table _time, "Principal Email", Project, Member</query> <earliest>-24h@h</earliest> <latest>now</latest> <sampleRatio>1</sampleRatio> </search> <option name="count">20</option> <option name="dataOverlayMode">none</option> <option name="drilldown">none</option> <option name="percentagesRow">false</option> <option name="rowNumbers">false</option> <option name="totalsRow">false</option> <option name="wrap">true</option> </table> </panel> </row> </dashboard>
Audit logs are one of my favorite things in Google Cloud. If you've ever wondered where to start with Google Cloud logs, definitely start here! You'll get a tremendous amount of insight into those otherwise opaque cloud projects. As you have probably noticed, there are endless possibilities as you dive into analyzing your audit logs. The examples in this blog post only scratch the surface.
I can't take credit for all the query ideas in this article. Some have been adapted from Cloud Logging and BigQuery examples found throughout Google Cloud documentation. As you browse through the links below, you'll likely see other queries you'll want to adapt to SPL. And of course when you do, please be sure to send them my way!
If you are using the Pub/Sub input feature of the Splunk Add-on for Google Cloud Platform rather than Dataflow to HEC, you will find that the log data structure is slightly different. In the case of a log message ingested using the add-on Pub/Sub input, you will need to add "data." to the start of key name references in your SPL. For example, if an SPL example references "protoPayload.requestMetadata.callerIp", it should be changed to read "data.protoPayload.requestMetadata.callerIp" instead.
The following example illustrates how to adapt a query.
Dataflow compatible (before):
index=main resource.type="gce_instance" | regex protoPayload.methodName="^\w+?\.?compute\.instances\.(hostError|automaticRestart|reset)$" | rename protoPayload.resourceName as Instance | rename protoPayload.methodName as Event | table _time, Event, Instance
Add-on compatible (after):
index=main resource.type="gce_instance" | regex data.protoPayload.methodName="^\w+?\.?compute\.instances\.(hostError|automaticRestart|reset)$" | rename data.protoPayload.resourceName as Instance | rename data.protoPayload.methodName as Event | table _time, Event, Instance
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.