-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Write Batch With Index
The WBWI (Write Batch With Index) encapsulates a WriteBatch and an Index into that WriteBatch. The index in use is a Skip List. The purpose of the WBWI is to sit above the DB, and offer the same basic operations as the DB, i.e. Writes - Put
, Delete
, and Merge
, and Reads - Get
, and newIterator
.
Write operations on the WBWI are serialized into the WriteBatch (of the WBWI) rather than acting directly on the DB. The WriteBatch can later be written atomically to the DB by calling db.write(wbwi)
.
Read operations can either be solely against the WriteBatch (e.g. GetFromBatch
), or they can be read-through operations. A read-through operation, (e.g. GetFromBatchAndDB
), first tries to read from the WriteBatch, if there is no updated entry in the WriteBatch then it subsequently reads from the DB.
The WriteBatch itself is a thin wrapper around a std::string
field called rep
. The WriteBatch can be thought of as append-only. Each new write operation simply appends to the WriteBatch rep
, see: https://github.com/facebook/rocksdb/blob/main/db/write_batch.cc#L10
The purpose of the Index within the WBWI is to track which keys that have been written to the WriteBatch. The index is a mapping from the key to the offset of the serialized operation within the WriteBatch (i.e. the position within the rep
string).
This index is used for read-operations against the WriteBatchWithIndex to determine whether a write operation on the WriteBatch has previously occurred for the key that is being written. If there is an entry in the Index the offset into the WriteBatch is used to read the value, if not, then the database can be consulted. The Index allows us to avoid having to scan the WriteBatch when performing Get
or Seek
(for iterator) operations.
The WBWI can be used as a component if one wishes to build Transaction Semantics atop RocksDB. The WBWI by itself isolates the Write Path to a local in-memory store and allows you to RYOW (Read-Your-Own-Writes) before data is atomically written to the database.
It is a key component in RocksDB's Pessimistic and Optimistic Transaction utility classes.
The WBWI has two modes of operation, 1) where all write operations to the same key can be retrieved by iteration, and 2) where only the latest write operation for a key can be retrieved by iteration. Regardless of the mode, all write operations themselves are preserved in the append-only WriteBatch, the mode only controls what is retrievable from the WriteBatch when RYOW.
The mode of operation is controlled by use of the WriteBatchWithIndex constructor argument named overwrite_key
. The default mode of operation is to allow all write operations to be retrieved (i.e. overwrite_key=false
).
Not all WBWI functions are supported in both modes, see the table below:
WBWI Function | overwrite_key=false | overwrite_key=true |
---|---|---|
Put | Yes | Yes |
Delete | Yes | Yes |
DeleteRange | No | No |
Merge | Yes | Yes |
GetFromBatch | Yes | Yes |
GetFromBatchAndDB | Yes | Yes |
NewIterator | One-or-more entries per Key in the batch | One entry per Key in the batch |
NewIteratorWithBase | Yes | Yes |
For GetFromBatch
and GetFromBatchAndDB
, if transaction contains Merge, then they return Status::MergeInProgress
if no base value is found, and return merged result if base value is found.
TODO
TODO
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc