With Kinesis Firehose being Splunk’s preferred option when collecting logs at scale from AWS Cloudwatch Logs, we’ve seen plenty of posts on setting this up, automation and examples on transforming event content. But what about when things go wrong?
When Kinesis Firehose fails to write to Splunk via HEC (due to a connection timeout, HEC token issues or other connectivity issues), it will eventually write its logs into a “splashback” S3 bucket to ensure that there is no loss of data. However, if you wish to retry sending the contents of the logs in the bucket back into Splunk you will note that the log contents that are written to the “splashback” bucket from Firehose are wrapped in JSON with additional information about the failure and the original message is base64 encoded.
This makes re-ingesting these “failed” logs a little more complex than simply using Splunk Add-On for AWS for instance, as it would not be possible to decode the contents of the message directly into Splunk. Also, note that Firehose cannot ingest directly from S3.
This blog describes two simple options of re-ingesting these logs using Lambda functions:
These solutions can work with both Splunk Enterprise (on-premise or in your Cloud) and Splunk Cloud.
The main component of this solution is a simple Lambda function that allows an ingest process to be possible with the Add-On. The function, once set up, is triggered when objects containing the failed logs from Firehose are written to the S3 bucket. The function reads the contents of the object, extracting and decoding the “raw content” that was attempted to be sent via HEC, then writing the output back into S3. It is written back to the same bucket, but as an object prefixed with SplashbackRawFailed/.
These objects can then be ingested by the Splunk Add-On for AWS using the standard inputs and configuration for S3 ingest - we would recommend using the SQS-based S3 input.
So the flow of data, as shown in the above diagram for a “failed” scenario is as follows:
This solution is very similar to the previous method and uses a Lambda function to read from the S3 “splashback” bucket. However, rather than writing the output into S3, the function writes back into a Kinesis Firehose data stream. The advantage of this method over the first is that the data collection method into Splunk doesn't change, and no Add-On configuration is required.
For this method, although technically it would be possible to re-ingest back into the same Firehose, a separate dedicated “re-ingest” Firehose data stream is recommended. This has two advantages: it could add the option to send the events into a separate Splunk HEC token input (or even instance), and it can also provide a “generic” retry capability for any Firehose. (note that the sample code provides this generic approach).
The flow of data, as shown in the above diagram for a “failed” scenario is as follows:
This solution is the recommended option, although it should be noted that if there is a very prolonged period of disconnect between Firehose and Splunk HEC, the volume of re-ingest and therefore data load on the retry Firehose may be significant and beyond a single firehose’s capacity. This will be unlikely in most cases, as disconnects (especially to Splunk Cloud) are very unlikely to last very long. The example function provides a “timeout” mechanism for looping re-tries (max 9 attempts which could be up to 18 hours) - this prevents a continuous looping scenario where there is a total loss of connectivity to Splunk. In the event of a full time-out, the events are eventually written (not encoded) to S3 in the same method as the first option.
Full details of the setup instructions and the source code for the sample Lambda functions can be found here: https://github.com/pauld-splunk/aws-splunk-firehose-error-reingest
Happy Splunking!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.