laranevans.com
Topics / Databases / B-tree

A B-tree is the index structure underneath most relational databases. It keeps keys sorted in a shallow, high-fanout tree and updates pages in place, which keeps point lookups and ordered range scans fast and predictable. Bayer and McCreight described it in 1972, and it became the default index in PostgreSQL, MySQL's InnoDB, and SQLite. For a workload that reads and range-scans about as often as it writes, a B-tree-backed store is the conservative choice.

A B-tree keeps keys sorted in a shallow, high-fanout tree

Each node is a disk page that holds a run of keys in sorted order with child pointers between them. High fanout, a large number of keys per page, keeps the tree shallow, so a lookup from the root reaches any key in a few page reads. Bayer and McCreight's design bounds retrieval, insertion, and deletion to a height that grows with the logarithm of the index size, base k, where k is set by how many keys fit on a page. The variant databases use is the B+tree, which stores values only in the leaves and links the leaves in order, so a range scan walks the leaf chain end to end without climbing back into the interior.

Updates happen in place, and a write-ahead log makes them durable

A write finds the target leaf page and changes it where it sits. When a page fills, it splits in two and the split propagates upward. Because the change lands in the page's final location, the database guards it with the write-ahead log: it records the change in a sequential log and flushes that log before acknowledging the commit, so a crash partway through recovers to a consistent state.

This in-place design sets the B-tree's write cost. A single-row update rewrites the row's whole page and appends a log record, so the bytes written to storage run ahead of the bytes the application changed. Updates scattered across the key space become random page writes. Under heavy write volume, that write amplification and the random I/O turn into the limit.

Reads and range scans are the B-tree's strength

A point lookup descends from the root to one leaf, a few page reads whatever the table's size. A range scan seeks to the start key once and then follows the linked leaves in sorted order, returning rows without a second descent. The work to locate a row stays close to flat as the data grows, so read latency holds steady. That steadiness, with native ordered access, is why B-trees back secondary indexes and queries that sort, group, or range over keys.

B-trees back most transactional relational stores

PostgreSQL creates a B-tree by default and indexes any type with a defined linear order with one. MySQL's InnoDB stores every non-spatial index as a B-tree and clusters each table around the B-tree of its primary key, so the table is itself an index. SQLite stores its tables and indexes as B-trees rooted at the schema page. The pattern holds across traditional relational engines: the primary store is a B-tree, and secondary indexes are more B-trees pointing into it.

Choosing a B-tree-backed store

Reach for a B-tree engine when reads and range scans carry the workload, when queries lean on sorted or clustered access, and when you want latency that stays level as the data grows. A transactional application with secondary indexes, foreign keys, and reporting queries fits the B-tree's shape. The cost you accept is write amplification: every update rewrites a page, and a write-saturated, random-insert workload drives the structure toward its weakest case. When writes and ingest volume dominate instead, the LSM-tree trades read and space overhead to make writes sequential. The two structures trade off differently against the same three measures, read, write, and space amplification, and the LSM-tree page works the choice out from the write-optimized side.

Measure B-tree LSM-tree
Read amplification Low. A point lookup is a few page reads whatever the table's size, and a range scan walks the linked leaves Higher. The LSM-tree accepts read overhead as part of making writes sequential
Write amplification High. Every update rewrites the row's whole page, so bytes written run ahead of bytes changed Lower. Sequential writes are the structure's optimized case
Space amplification Low. Updates land in place, so the index holds one copy of each key Higher. The LSM-tree accepts space overhead as part of making writes sequential