Facebook recently announced that is has released LogDevice to the world as open source. Given similarities in the target use cases that it lists, there have naturally been some people asking if it has similarities to Apache Pulsar. In this blog post, we will provide some answers.
Comparing LogDevice and Pulsar is not an apples-to-apples comparison—LogDevice operates at a lower level than Pulsar. In fact, it’s more similar to Twitter’s DistributedLog in that it’s only concerned with the log primitive and as a result does not include higher-level features like schema management, multi-tenancy, and cursor management. These would be left for the user to implement on top of LogDevice.
So rather than trying to compare the feature matrix of both systems, this post will focus on the fundamental primitive that LogDevice and Pulsar have in common: the distributed log. We have tried to be as unbiased as we can, though as we work on Pulsar, some bias cannot be avoided. :)
LogDevice presents a log primitive to users. Writing clients write to a sequencer node, which assigns a log sequence number (LSN) to each entry and then writes the entry to a subset of nodes (copyset) of a larger set that is assigned to the log (nodeset). The sequencer in LogDevice is analogous to the broker in Pulsar, as it is the broker which assigns message IDs to messages and sends them to Apache BookKeeper for storage.
In general, LogDevice’s architecture has a number of things in common with Pulsar. Both separate compute from storage. This comes with many benefits compared to the monolithic architecture present in some other systems:
These benefits were covered in a previous blog post about Pulsar’s segment-oriented architecture, and all these benefits apply as much to LogDevice as they do to Pulsar.
LogDevice and Pulsar differ in how reads are performed. In Pulsar, reading clients subscribe to topics on the broker, receiving any messages for the topic from the broker. In LogDevice, reading clients connect directly to the storage nodes.
Reading directly from the storage nodes, as LogDevice does, allows a greater degree of fanout in reads. That is, the system can support more readers for a single topic as they do not all need to hit the same nodes.
However, reading directly from the storage nodes comes with a latency penalty when you need consistency in your log. A reader should only be able to read an entry if the write of that entry has been acknowledged to the writer. In the case of reading from the storage nodes, the entry should not be readable until the storage node is somehow notified that the entry has been replicated to enough nodes to merit acknowledging the write to the writer.
Pulsar's approach, in which clients connect via the broker for both reads and writes, has benefits in latency and in functionality. Because the same node is handling reads and writes, the entry is readable at the same moment that the entry acknowledgement is sent to the writer. By controlling reads and writes at the broker, Pulsar can also support more complex subscription models, such as shared subscriptions and failover subscriptions.
LogDevice and Pulsar use similar techniques to achieve total order atomic broadcast (TOAB). The log is split into epochs, one node (the leader) gets to decide the sequence numbers for the entries in that epoch, and there is a mechanism to prevent the previous epochs from being written to.
In both LogDevice and Pulsar, ZooKeeper is used to decide who is the leader.
In LogDevice the leader is called the sequencer. Each sequencer for a log is assigned an ‘epoch’ number (from ZooKeeper), which is higher than the epoch number of all previous sequencers for that log. The sequencer decides the LSN for each entry, which is composed of the epoch plus a local monotonically increasing component, and forwards the entries to a set of storage nodes. It acknowledges the write of the entry to the client once enough storage nodes have acknowledged the entry. In the case of sequencer failure, a new sequencer gets a new epoch and can start serving writes immediately. An operation is kicked off in the background to ‘seal’ the previous epoch. Reads from the new epoch are blocked until the previous epoch is sealed. The ‘seal’ operation involves informing enough storage nodes from the nodeset that there is a new epoch, such that a write operation would not be able to get enough acknowledgments to acknowledge a write to the client.
For Pulsar, the epoch is a BookKeeper ledger. Each topic has a list of BookKeeper ledgers which constitute the whole log of the topic. When a Pulsar broker crashes, another broker has to take over the topic, so they ensure the previous owner’s ledger is sealed, create their own ledger, and add it to the list of ledgers for the topic. These last three operations involve ZooKeeper. Once the list of ledgers for the topic is updated, the broker can start serving reads and writes on the topic. All writes to the topic are persisted to the bookkeeper ledger, which is stored on a set of storage nodes, before being acknowledged to the writer and visible to readers.
Both LogDevice and Pulsar (via BookKeeper) only require that an entry hit a subset of nodes for it to be considered persisted, and so can maintain low latency on writes in the presence of a slow or failed storage node.
On leader failure, LogDevice is available for writes very soon after the failure is detected, only requiring a couple of round-trips to ZooKeeper to select a new sequencer. Pulsar recovers the previous ledger before becoming available for writes again. This recovery involves talking to some of the storage nodes and another write to ZooKeeper. In Pulsar reading becomes available at the same time as writes, whereas LogDevice has to run a ‘seal’ operation, similar to ledger recovery, before reads can become available.
We suspect that LogDevice’s decision to allow writes before the previous epoch is sealed stems from the fact that reads are uncoordinated, rather than from the performance concerns involved. Detecting that the sequencer has failed is going to dominate the recovery time whether you seal the previous epoch or not. Sealing before allowing writes would require that the sequencer coordinate with the readers, which would incur complexity for little benefit. In Pulsar, since it is the broker serving reads, recovering the previous ledger before allowing writes is very straight-forward.
LogDevice storage nodes store entries in RocksDB. Entries are stored in a collection of time-ordered column families, with entries keyed with a combination of the log id and the LSN of the entry. In simpler terms, each storage node has many time-ordered RocksDB instances, of which only the latest is ever written to. They try to ensure that compaction rarely happens, to avoid write amplification.
Pulsar storage nodes, i.e. BookKeeper bookies, have a journal, an entry log and an index. The journal has its own dedicated disk. When an entry is written to the bookie, it is written to the journal disk and acknowledged to the writer. It’s then put in a staging area for the entry log. When enough entries have entered this staging area, they are stored by ledger ID and entry ID, flushed to the entry log, and the location of each entry in the entry log is written to the index, which, in the configuration recommended by Streamlio, is a RocksDB instance.
The designs of the storage layers of both LogDevice and Pulsar optimize for low latency writes with many concurrently active logs. By interleaving the entries of many logs in few files, random writes are minimized. This has a larger impact on rotational disks, where writing to many files means the disk head has to physically move many times, but even on solid state disks preferring sequential over random writes has a hugely beneficial impact on performance.
However, there’s no such thing as a free lunch. Interleaving writes means that reads need to do more work.
Reads in a logging system generally fall into two categories, tailing reads and catch-up reads. For tailing reads, neither LogDevice nor Pulsar will likely will ever hit disk, as the required data should still be in a memory cache at some level. Catch-up reads, however, will end up hitting disk. In general, throughput is more important than latency for catch-up reads, and both systems design for this.
LogDevice needs to read many SST files to perform a catch-up read, though most of the read should be sequential, as RocksDB sorts entries by key before flushing to disk. It is unclear whether reads and writes are served from the same disk. If so, catch-up reads could have an impact on the write performance of the system.
RocksDB does allow multiple paths to be configured, with older SST files stored apart from newer so it wouldn’t surprise me if Facebook were doing this in production.
As Pulsar keeps the critical path for writes on a separate disk, read operations are entirely independent. They are also mostly sequential, as data in the entry log is sorted by ledger and entry ID before it is flushed to disk.
LogDevice tries to avoid compaction, and thus write amplification. This makes sense for a logging system, as most data that is written will never need to be read. However, it has a side effect on data retention. Single logs cannot be deleted, so all logs in the system must be ruled by retention time. You cannot have some logs which live forever, and some which only live for a few hours within the same cluster.
In Pulsar, deleting a log from a storage node means first deleting it from the index. The index does therefore compact every so often, but as the entry data itself is not in the index, this is not a problem. The storage node monitors what percentage of each entry log is referenced by the index. Once an entry log goes below a certain threshold, the live data is copied to a new “compacted” entry log, the index is updated, and the original entry log is deleted.
LogDevice is an interesting addition to the distributed logging space. Although it is not directly comparable to Pulsar because Pulsar is designed to be a complete messaging platform rather than solely a distributed log, it’s gratifying to see that the LogDevice team made a lot of similar architectural decisions to those made for Pulsar. Now that LogDevice is available as open source, we look forward to seeing how it is put to use.
----------------------------------------------------
Thanks!
Ivan Kelly
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.