LSM-tree — laranevans.com

An LSM-tree turns random writes into sequential ones. It buffers incoming writes in memory, flushes them to immutable files on disk, and merges those files in the background. O'Neil, Cheng, Gawlick, and O'Neil introduced it in 1996 as a structure for files taking a high rate of inserts and deletes. It became the engine under write-heavy stores like Apache Cassandra, RocksDB, and ScyllaDB. For a workload dominated by writes or high-volume ingest, an LSM-backed store absorbs the load that drives a B-tree toward its weakest case.

An LSM-tree buffers writes in memory and flushes them to immutable files

A write lands in two places: an in-memory table, the memtable, and a sequential write-ahead log for durability. When the memtable fills, the engine flushes it to disk as one immutable sorted file, an SSTable, in a single sequential write, then opens a fresh memtable. Cassandra's storage engine follows this shape: a memtable, a commit log, and immutable SSTables. Because every disk write appends sorted data and no file changes after it lands, the structure defers and batches changes in memory before cascading them to disk, replacing random page writes with sequential ones.

Compaction merges the files in the background, and it sets the cost

Immutable files pile up, so a background process called compaction merges them, keeps the newest value for each key, and discards deleted ones. The merge strategy decides the tradeoff. Leveled compaction holds each level near ten times the size of the one above it. ScyllaDB's measurements show levels of 10, then 100, then 1000 files, with each new file compacted against the roughly ten overlapping files one level down. That repeated rewriting is the LSM-tree's write amplification: ScyllaDB recorded 8.8 GB of new data producing about 111 GB written to disk, near 13 times, against roughly 5 times for size-tiered compaction on the same load. Leveled compaction spends that write cost to hold space amplification near 1.1 times. Size-tiered compaction lowers the write cost and lets space grow toward 8 times in the worst case. Compaction also competes with live traffic for disk and CPU, which surfaces as write stalls under sustained load.

Reads check memory and several files, and bloom filters keep that cheap

A read checks the memtable first, then the on-disk files from newest to oldest, because a key's current value sits in the most recent file that holds it. Touching several files per read is the LSM-tree's read amplification. A bloom filter per file cuts it down: the filter reports whether a file might hold a key or definitely does not, so the engine skips files that cannot hold the key and reads only the ones that might. RocksDB spends about 9.9 bits per key for a 1% false-positive rate, and offers a Ribbon filter that saves roughly 30% of that space for more CPU. Point reads stay cheap. A range scan costs more than on a B-tree, because the engine merges overlapping ranges across every level rather than walking one sorted leaf chain.

LSM-trees back write-heavy and high-ingest stores

The structure turns up wherever write and ingest volume dominate. Cassandra and ScyllaDB use it for high-write distributed tables. RocksDB embeds it as a write-optimized key-value library, and MyRocks puts RocksDB under MySQL, where the project reports about 10 times less write amplification and 2 times better compression than InnoDB. LevelDB, HBase, and Google's Bigtable build on the same memtable-and-SSTable pattern. The common factor is a workload that writes faster than an in-place structure absorbs.

Choosing an LSM-backed store

Reach for an LSM engine when writes or ingest dominate, when the data is append-heavy like event logs, metrics, or time series, and when sequential writes and compression matter for flash endurance. The MyRocks write-amplification and compression numbers point at why LSM-trees suit SSDs: fewer physical writes per logical write extend the drive's life and lift throughput. The cost you accept is read and space overhead and the operational work of tuning compaction. When reads, range scans, and steady latency carry the workload instead, the B-tree makes the opposite trade. Both structures answer the same read, write, and space tradeoff: an access method optimizes two of the three and pays on the third, and the LSM-tree spends read and space to win on writes.

The read, write, and space spine

Every access method optimizes two of read, write, and space and pays on the third. The LSM-tree spends read and space to win on writes, and the compaction strategy sets where on the read/write/space spine the engine lands.

Dimension	What it measures	LSM-tree position	Leveled compaction	Size-tiered compaction
Write	Physical writes per logical write (write amplification)	The structure's strength: random writes become sequential appends	About 13 times on ScyllaDB's load	About 5 times on the same load
Read	Files touched per read (read amplification)	The cost the structure accepts: a read checks the memtable, then files newest to oldest, with bloom filters cutting the misses	Tighter reads	Looser reads
Space	Bytes on disk per logical byte (space amplification)	The cost the structure accepts: stale and deleted keys linger until compaction	Near 1.1 times	Toward 8 times in the worst case