Observability

October 12, 2018

4 Minute Read

Configuring Apache Pulsar Tiered Storage with Amazon S3

By Splunk

Apache Pulsar’s tiered storage feature enables Pulsar to offload older messages on a topic to a long-term storage system, freeing up space in Apache BookKeeper and taking advantage of scalable low-cost storage options such as cloud storage.

Tiered storage is valuable for a topic for which you want to retain data for a very long time. For example, if you have a topic containing user actions which you use to train your recommendation systems, you may want to keep that data for a long time so that if you create a new version of your recommendation model you can rerun it against the full user history.

How Apache Pulsar Stores Data

Apache Pulsar stores topics using what we call a segment-oriented architecture. A topic in Pulsar is persisted to a log, known as a managed ledger, stored in Apache BookKeeper. This log is composed of an ordered list of segments. Because a log is append-only, Pulsar only writes to the final segment of the log. All previous segments are sealed, and the data within the segment is immutable.

The tiered storage offloading mechanism takes advantage of this segment-oriented architecture. When a segment is offloaded to an external storage system, the segments of the log are copied, one-by-one, to that storage system. All segments of the log, apart from the segment currently being written to, can be offloaded.

^{Tiered storage illustration}

Apache Pulsar current supports multiple cloud storage systems for tiered storage. In this post we’ll walk through a simple example of configuring a standalone Pulsar cluster to use Amazon S3 to store the offloaded segments. The key steps that we’ll cover:

Creating and configuring a bucket in Amazon S3.
Configuring Pulsar to use that S3 bucket as a storage tier.
Start Pulsar and validating offload.

There’s also a video recording of these steps at the end of this blog post.

Step-by-Step Configuration

Step 1: Setting up an S3 bucket

Our first step is to create the S3 bucket that will be used as tiered storage. To do that, we first log in to the AWS console and choose the S3 service.

^{Using the AWS Console to create an Amazon S3 bucket}

Then, create a bucket. First click the “Create bucket” button, then name this bucket, then click “Next” button, then confirm the creation operation.

^{Creating an Amazon S3 bucket}

After this, a new Bucket should be created successfully.

^{Successful creation of an Amazon S3 bucket}

Also make sure your aws credentials are set correctly.

$ cat ~/.aws/credentials
[default]
aws_access_key_id = XXXXXXXXXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXXXXX

Step 2: Configure Pulsar

Now let’s configure Pulsar to use that S3 bucket as a cold storage tier.

To do that, first download the Pulsar file (apache-pulsar-x.x.x-bin.tar.gz) from http://pulsar.apache.org/en/download and un-tar it.

Then cd into the binary root directory and edit conf/standalone.conf, adding the offload configuration settings at the end of file conf/standalone.conf:

managedLedgerOffloadDriver=S3
s3ManagedLedgerOffloadBucket=offload-test-aws
s3ManagedLedgerOffloadRegion=us-east-1

Also in the same config file conf/standalone.conf, change the ledger size and rollover time to make topics create each segment more easily:

# Max number of entries to append to a ledger before triggering a rollover
# A ledger rollover is triggered on these conditions
# * Either the max rollover time has been reached
# * or max entries have been written to the ledger and at least min-time has passed
managedLedgerMaxEnteriesPerLedger=1000

# Minimum time between ledger rollover for a topic
managedLedgerMinLedgerRolloverTimeMinutes=0

Then start Pulsar in standalone mode:

$ bin/pulsar standalone

Step 3: Verify Segment Offloading

Now let’s test our configuration by consuming and producing messages.

In a new terminal tab, run the consume command to make sure topic data not be atomically dropped.

$ bin/pulsar-client consume -s “my-sub-name“ my-topic-for-offload

In a new terminal tab, run the produce command twice, to make sure it is big enough to create 2 segments(each with 1000 entries).

$ bin/pulsar-client produce my-topic-for-offload  --messages "hello pulsar this is the content for each message" -n 1000

Now let’s manually trigger offload by using the Pulsar admin cli.

$ bin/pulsar-admin topics offload --size-threshold 10K public/default/my-topic-for-offload

Offload triggered for persistent://public/default/my-topic-for-offload for messages before 32:0:-1

Get offload status by cli.

$ bin/pulsar-admin topics offload-status public/default/my-topic-for-offload

Offload was a success

Once the status is “success”, we can find the offloaded ledger in S3 using the AWS console.

^{Offloaded segment files stored in Amazon S3}

If we use the Pulsar admin command to get topic internal stats, we will find that ledger-31 is in the state “offloaded: true”.

$	bin/pulsar-admin topics stats-internal  public/default/my-topic-for-offload

[
  "entriesAddedCounter" : 3200,
  "numberOfEntries" : 1200,
  "totalSize" : 111344,
  "currentLedgerEntries" : 200,
  "currentLedgerSize" : 18600,
  "lastLedgerCreatedTimestamp" : "2018-10-11T07:06:14.891+08:00",
  "waitingCursorsCount" : 0,
  "pendingAddEntriesCount" : 0,
  "lastConfirmedEntry" : "32:199",
  "state" : "LedgerOpened",
  "ledgers" : [ {
      "ledgerId" : 31,
      "entries" : 1000,
      "size" : 92744,
      "offloaded" : true
  }, {
    "ledgerId" : 32,
    "entries" : 0,
    "size" : 0,
    "offloaded" : false
  } ],
  "cursors" : false

Video Demonstration

To see the step-by-step configuration process, this video walks through how to configure a full Pulsar cluster to use tiered storage in Amazon S3.

For More Information

If you want to get more details about how tiered storage works and how to configure it, please refer to our blog post on the topic or links to the Pulsar website here and here.

This post features contributions from Ivan Kelly and Jia Zhai.

----------------------------------------------------
Thanks!
Ivan Kelly

Splunk

The world’s leading organizations trust Splunk to help keep their digital systems secure and reliable. Our software solutions and services help to prevent major issues, absorb shocks and accelerate transformation. Learn what Splunk does and why customers choose Splunk.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.