Apache Pulsar’s tiered storage feature enables Pulsar to offload older messages on a topic to a long-term storage system, freeing up space in Apache BookKeeper and taking advantage of scalable low-cost storage options such as cloud storage.
Tiered storage is valuable for a topic for which you want to retain data for a very long time. For example, if you have a topic containing user actions which you use to train your recommendation systems, you may want to keep that data for a long time so that if you create a new version of your recommendation model you can rerun it against the full user history.
Apache Pulsar stores topics using what we call a segment-oriented architecture. A topic in Pulsar is persisted to a log, known as a managed ledger, stored in Apache BookKeeper. This log is composed of an ordered list of segments. Because a log is append-only, Pulsar only writes to the final segment of the log. All previous segments are sealed, and the data within the segment is immutable.
The tiered storage offloading mechanism takes advantage of this segment-oriented architecture. When a segment is offloaded to an external storage system, the segments of the log are copied, one-by-one, to that storage system. All segments of the log, apart from the segment currently being written to, can be offloaded.
Tiered storage illustration
Apache Pulsar current supports multiple cloud storage systems for tiered storage. In this post we’ll walk through a simple example of configuring a standalone Pulsar cluster to use Amazon S3 to store the offloaded segments. The key steps that we’ll cover:
There’s also a video recording of these steps at the end of this blog post.
Our first step is to create the S3 bucket that will be used as tiered storage. To do that, we first log in to the AWS console and choose the S3 service.
Using the AWS Console to create an Amazon S3 bucket
Then, create a bucket. First click the “Create bucket” button, then name this bucket, then click “Next” button, then confirm the creation operation.
Creating an Amazon S3 bucket
After this, a new Bucket should be created successfully.
Successful creation of an Amazon S3 bucket
Also make sure your aws credentials are set correctly.
$ cat ~/.aws/credentials [default] aws_access_key_id = XXXXXXXXXXXXXXXXXXXXXX aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXXXXX
Now let’s configure Pulsar to use that S3 bucket as a cold storage tier.
To do that, first download the Pulsar file (apache-pulsar-x.x.x-bin.tar.gz) from http://pulsar.apache.org/en/download and un-tar it.
Then cd into the binary root directory and edit conf/standalone.conf, adding the offload configuration settings at the end of file conf/standalone.conf:
managedLedgerOffloadDriver=S3 s3ManagedLedgerOffloadBucket=offload-test-aws s3ManagedLedgerOffloadRegion=us-east-1
Also in the same config file conf/standalone.conf, change the ledger size and rollover time to make topics create each segment more easily:
# Max number of entries to append to a ledger before triggering a rollover # A ledger rollover is triggered on these conditions # * Either the max rollover time has been reached # * or max entries have been written to the ledger and at least min-time has passed managedLedgerMaxEnteriesPerLedger=1000 # Minimum time between ledger rollover for a topic managedLedgerMinLedgerRolloverTimeMinutes=0
Then start Pulsar in standalone mode:
$ bin/pulsar standalone
Now let’s test our configuration by consuming and producing messages.
In a new terminal tab, run the consume command to make sure topic data not be atomically dropped.
$ bin/pulsar-client consume -s “my-sub-name“ my-topic-for-offload
In a new terminal tab, run the produce command twice, to make sure it is big enough to create 2 segments(each with 1000 entries).
$ bin/pulsar-client produce my-topic-for-offload --messages "hello pulsar this is the content for each message" -n 1000
Now let’s manually trigger offload by using the Pulsar admin cli.
$ bin/pulsar-admin topics offload --size-threshold 10K public/default/my-topic-for-offload Offload triggered for persistent://public/default/my-topic-for-offload for messages before 32:0:-1
Get offload status by cli.
$ bin/pulsar-admin topics offload-status public/default/my-topic-for-offload Offload was a success
Once the status is “success”, we can find the offloaded ledger in S3 using the AWS console.
Offloaded segment files stored in Amazon S3
If we use the Pulsar admin command to get topic internal stats, we will find that ledger-31 is in the state “offloaded: true”.
$ bin/pulsar-admin topics stats-internal public/default/my-topic-for-offload [ "entriesAddedCounter" : 3200, "numberOfEntries" : 1200, "totalSize" : 111344, "currentLedgerEntries" : 200, "currentLedgerSize" : 18600, "lastLedgerCreatedTimestamp" : "2018-10-11T07:06:14.891+08:00", "waitingCursorsCount" : 0, "pendingAddEntriesCount" : 0, "lastConfirmedEntry" : "32:199", "state" : "LedgerOpened", "ledgers" : [ { "ledgerId" : 31, "entries" : 1000, "size" : 92744, "offloaded" : true }, { "ledgerId" : 32, "entries" : 0, "size" : 0, "offloaded" : false } ], "cursors" : false
To see the step-by-step configuration process, this video walks through how to configure a full Pulsar cluster to use tiered storage in Amazon S3.
If you want to get more details about how tiered storage works and how to configure it, please refer to our blog post on the topic or links to the Pulsar website here and here.
This post features contributions from Ivan Kelly and Jia Zhai.
----------------------------------------------------
Thanks!
Ivan Kelly
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.