Lessons From RocksDB in Large-scale Distributed System Applications -

RocksDB is a high-performance persistent key-value storage engine created by Facebook in 2012. It is based on Google's LevelDB code. RocksDB is LSM-based, optimized for SSDs, and serves as a storage engine for multiple databases. It is used in stream processing, log queue services, index services, and SSD caching.

Engine Architecture

MemTable and WAL

RocksDB uses LSM (Log-Structured Merge-tree) as its primary storage data structure. Whenever data is written to RocksDB, it is added to the MemTable (an in-memory write buffer) and a Write-Ahead Log (WAL) on disk. Data is written to both WAL and MemTable, with WAL serving as a volatile protection mechanism for MemTable.

RocksDB has three data structures for MemTable: skiplist, hash-skiplist, and hash-linklist. Skiplists ensure data remains ordered during insertion and support binary search and range queries. The cost of insertion and search is O(log n).

Once a specified size is reached, existing MemTable and WAL become immutable, and new data is written to new MemTable and WAL.

SSTable

SSTable is a data structure used to flush MemTable to disk when it reaches a certain limit. These SSTables are placed in Level 0 (L0), and corresponding WAL space is recycled. When L0 reaches its size limit, L0 SSTables go through compaction and move to L1. This process continues for higher levels.

Binary search is used at each level, and a Bloom filter helps eliminate unnecessary lookups in SSTable files.

Compaction

Compaction, also known as compression, serves two primary purposes: removing obsolete data and optimizing data sorting. Obsolete data is generated by updates, including overwrite and delete operations. Compaction provides several advantages, including batch reading and writing of entire files, which is highly efficient and can be parallelized.

Common compaction strategies include:

Leveled: Similar to leveldb, this strategy involves exponential growth in the number of SSTables with each level. It can result in high read and write amplification but lower space amplification.
Tiered/Universal: Used by Cassandra and HBase, this strategy leads to high space and read amplification but minimal write amplification.
FIFO: Deletes old expired files and performs lightweight compression, suitable for in-memory cache applications.

It's worth noting that there is no one-size-fits-all solution for minimizing write amplification, read amplification, and space amplification in compaction strategies. Like other areas of the computer world, databases do not have silver bullets.

Resource Optimization

Goals To address issues with RocksDB, many teams have optimized resource management in various ways. Here, we discuss two aspects of resource optimization: write amplification and space amplification.

Write Amplification

Write amplification arises from several factors:

Solid-state drives (SSDs) themselves introduce write amplification ranging from 1.1 to 3.
Writing small pages (4KB/8KB/16KB) with fewer than 100 bytes of changes can lead to a 100x write amplification.
Different compaction strategies have different write amplification ratios.
To mitigate write amplification, consider the following optimization suggestions:
Reduce write amplification during high write rates.
Actively compact during low write rates.
Consider using BlobDB to separate keys and values when values are large.

Space Amplification

When using SSDs, space utilization is more critical than write amplification because SSDs have unlimited write cycles and low write overhead. To optimize space amplification, consider using dynamic tiered compression, where each level adjusts its size based on the actual size of the previous level, rather than having fixed sizes for each level, allowing more efficient space utilization.

Experience in Large-scale Systems

RocksDB is used in building large-scale distributed systems for various requirements. Below, we discuss related experiences and improvement ideas.

Resource Management

Large-scale distributed systems divide data into shards distributed across multiple server nodes. Additionally, a single host may run multiple instances of RocksDB, requiring resource management at both global (per host) and local (per instance) levels.

Managed resources include:

Memory: Write buffer, Block cache
I/O bandwidth: Compaction
Threads: Compaction (preferably using thread pools)
Disk usage
File deletion frequency
Strive to meet the following requirements: Ensure that no instance overuses any resource, and set resource usage priorities among different instances.

WAL

In traditional databases, WAL operations are preferred to ensure durability. In distributed scenarios, data has replicas, and consistency checks are performed. If Paxos is used for data synchronization between replicas, WAL becomes unnecessary. Three different WAL strategies can be considered: synchronous write, buffered asynchronous write, and no write.

Limiting File Deletion

Some file systems can identify SSDs, such as XFS. When deleting files, the file system sends TRIM commands to the SSD, typically improving SSD performance and durability. However, it can also lead to performance issues by triggering SSD internal garbage collection and increasing I/O latency. Therefore, file deletion rates need to be limited to prevent multiple files from being deleted simultaneously during compaction.

Data Format Compatibility

Large-scale distributed systems must provide online rolling upgrades and rollbacks. Therefore, data format design at the lowest level must ensure both forward and backward compatibility. Strategies include:

All versions must understand all previously written disk formats.

Future formats should also be understandable.

Consider using technologies from Protocol Buffers and Thrift.

Design forward and backward compatibility for configuration file settings.

Perform version compatibility testing.

Replication and Backup

RocksDB is a single-node storage engine, so it requires support for replication and backup in various ways. Two methods can support existing replication:

Logical replication: All key-value pairs can be read and written to data replicas.
Supports full data scanning operations and minimizes the impact on normal read and write operations.
The query results of logical replication are not cached.
Batch loading is supported on the target side, with appropriate optimizations.
Physical replication: Directly copies SSTables and other files.
The interface for physical replication should be provided by the engine rather than allowing users or other applications to manipulate data files directly.

Logical replication is based on the replication of existing data files and is more suitable for full replication when creating new copies. Physical replication provides a faster option for full replication.

The replication mechanism also relates to consistency, and backup includes logical and physical backups. The difference from replication is that multiple backups are needed for applications. Overall, the methods and means used for backup and replication are similar, but from a user-friendly perspective, the storage engine should have the capability to manage multiple backups.

Lessons from Failures

For performance reasons, users may not use SSD data protection features (DIF, DIX). Detection of storage medium damage should be done through block verification checks by the storage engine. Additionally, data corruption may be introduced during replication, backup, and data recovery processes.

Data corruption should be detected as early as possible to reduce downtime and data loss. Each piece of data should be replicated on multiple hosts. When a checksum mismatch is detected, the damaged replica is disconnected, and a correct replica takes its place. At least one correct replica should be ensured. Multiple levels of verification include:

Block: Each SSTable block and WAL block has an appended checksum generated when the data is created. Validation occurs every time data is read to prevent corrupted data from being exposed to users.
File: Each SSTable file has its checksum, recorded in the SSTable's metadata.
Handoff: This is a checksum passed down when writing data to the file system, verified by some mechanisms of the file system. However, not all file systems support this kind of verification, so planning should be done in advance.

Conclusion

Through this article, we hope to provide better insights into RocksDB and how it can be applied in various large-scale distributed systems.

References

RocksDB Github Wiki: https://github.com/facebook/rocksdb/wiki

Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The RocksDB Experience: https://www.usenix.org/conference/fast21/presentation/dong

RocksDB Principles and Applications: https://zhuanlan.zhihu.com/p/409334218