Skip to content

C++ Project that Parses UTXO data from the Bitcoin chainstate Database. Dumps the entire UTXO set to stdout or executes lookup of UTXO based on outpoint.

Notifications You must be signed in to change notification settings

csknk/parse-chainstate

Repository files navigation

Access Bitcoin UTXO Set With the Bitcoin Core chainstate Database

C++ project to access Bitcoin UTXO data by parsing the chainstate database.

Build

  • Clone this project
  • Run make in project root to build

Usage

./bin/main <option(s)> <SOURCE DATABASE>

Options:

  • -h,--help: Show this help message
  • -m,--mode <mode>: dump_all to dump all UTXOs, single for a single txid. Default is single.
  • -t, --txid <txid>: Hexstring representation of UTXO to lookup. If mode is t and no txid is provided, user will be prompted to enter one.
  • -o <vout>: Output index of the required outpoint

Examples:

  • Run ./bin/main <path to chainstate database> to lookup an individual UTXO: programme prompts for txid and vout on stdin.
  • Run ./bin/main -m dump_all <path to chainstate database> to dump all UTXOs into CSV format on stdout (this will be a lengthy process)
  • As above, redirect output to a file: ./bin/main -m dump_all <path to chainstate database> > path/results.csv

The chainstate database should be a copy, not currently being accessed by Bitcoin Core.

UTXOs

The Unspent Transaction Output (UTXO) set is a subset of Bitcoin transaction outputs that are not yet spent.

For a new transaction to be valid, it must have access to UTXOs that can be used as inputs - the creator of the transaction must be able to meet the spending conditions of the transaction input UTXOs. Transactions consume UTXOs as inputs and create new UTXOs as outputs - with spending conditions locked such that the intended recipient can unlock the new UTXOs.

The UTXO set contains all unspent outputs - it therefore contains all the data necessary to validate new transactions.

UTXOs consist of two parts:

  1. The amount transferred to the output
  2. The locking script scriptPubKey that specifies the spending conditions for the output

Outpoints

Each transaction (with the exception of coinbase transactions) spends a UTXO from a previous transaction. In the context of a transaction, UTXOs are referenced by "outpoints".

A single transaction can have multiple outputs - so each UTXO outpoint consists of their transaction ID (txid) and output index number (often referred to as vout).

Together, the txid and output index are known as the UTXO outpoint.

Local Database: chainstate

Bitcoin Core full nodes store UTXO data in the chainstate LevelDB database. Data is stored in a per-output model - each entry in the chainstate database represents a single UTXO.

UTXO data is stored in this way so that transactions can be validated and new transactions created without the necessity of checking the entire blockchain.

At the time of writing, the chainstate database is approximately 4GB in size.

Data Storage Format

LevelDB is a simple on-disk key-value store. Keys and values are stored as strings in arbitrary byte arrays, sorted by key.

Indexing is not supported.

Though LevelDB keys & values are strings, they are not C-style null-terminated strings - this is because LevelDB keys & values may contain null bytes.

Chainstate Keys

Keys in the chainstate database consist of a little-endian representation of the txid prepended with the single byte 0x43 ('C') and appended with a Varint (see following section) representation of the vout:

0x43<txid, little endian><vout>

Chainstate Values

Values in the chainstate database contain the following data:

  • Block height
  • Whether or not the UTXO is a coinbase transaction
  • Amount (in Satoshis)
  • nSize - an indication of the type/size of locking script
  • Locking script - hash 160 for P2PKH & P2SH, public key for P2PK, otherwise full script

Value Obfuscation

Values in the chainstate database are obfuscated - this was a change added to the database in order to prevent false positives triggered in Windows anti-virus software.

Obfuscation is a simple XOR operation against a repeated 8-byte obfuscation key. The obfuscation key is a random value unique to each node, with the obfuscation key stored in the chainstate database itself, under the key 0x0e00 concatenated with the raw bytes of the string "obfuscate_key", i.e. 0e006f62667573636174655f6b6579 (see here).

This means that values must be XORed against the obfuscation key after they have been retrieved.

TODO: If this tool is expanded to provide analytic data on UTXOs - possibly by building a relational database from UTXO data - consider building a custom version of Bitcoin Core without obfuscation and a rebuild chainstate database. If dumping all UTXOs, this will likely provide a performance boost. This might be achieved by removing the following code and re-building the chainstate database - see this note:

    // The base-case obfuscation key, which is a noop.
    obfuscate_key = std::vector<unsigned char>(OBFUSCATE_KEY_NUM_BYTES, '\000');

    bool key_exists = Read(OBFUSCATE_KEY_KEY, obfuscate_key);

    if (!key_exists && obfuscate && IsEmpty()) {
        // Initialize non-degenerate obfuscation if it won't upset
        // existing, non-obfuscated data.
        std::vector<unsigned char> new_key = CreateObfuscateKey();

        // Write `new_key` so we don't obfuscate the key with itself
        Write(OBFUSCATE_KEY_KEY, new_key);
        obfuscate_key = new_key;

        LogPrintf("Wrote new obfuscate key for %s: %s\n", path.string(), HexStr(obfuscate_key));
    }

    LogPrintf("Using obfuscation key for %s: %s\n", path.string(), HexStr(obfuscate_key));

See here.

Once the value has been de-obfuscated, data is held in the following format:

<Varint block height><Varint amount><nSize><locking script>

Varints in the context of the chainstate database are described below.

Variable Length Integers: Varints

Bitcoin uses Varints to transmit and store values where the minimum number of bytes required to store a value is not known.

For example, a block height that is less than or equal to 255 could be stored in a single byte (as a uint8_t or unsigned char data type) whereas the block height 649392 would require a minimum of three (unsigned) bytes:

Value Minimum Byte Representation
649392 09 E8 B0
(9 * 256²) + (232 * 256) + 176 = 649392

To efficiently allow for such variability, Bitcoin uses a system of variable-length integers such that a minimal amount of space is used to store integers, whilst allowing for integers to be as large or as small as necessary.

Varints serialize integers into one or more bytes, with smaller numbers requiring fewer bytes to be encoded.

Varints vs compactSize Integers

Bitcoin has multiple methods for encoding variable length integers, with different methods used in different parts of the codebase.

The raw transaction format and peer-to-peer network messages within Bitcoin use a type of variable length integer encoding known as "compactSize". This involves prepending integers with a byte that indicates integer length for numbers greater than 252.

Used in the transaction format, compactSize integers format part of the Bitcoin consensus rules.

This document and repo is concerned with Varints as this is the method which Bitcoin core uses to serialize data to disk in the LevelDB database.

Varints in the LevelDB chainstate Database

In the context of storing data in the levelDB chainstate database (which stores UTXO data), integers are stored as base 128 encoded numbers.

In this system, the last 7 bits in each byte are used to represent a digit, and the position of the byte represents the power of 128 to be multiplied.

This leaves the most significant bit (MSB) of each byte available to carry information regarding whether or not the integer is complete.

If the MSB of a byte is set, the next digit (byte) should be read as part of the integer. If the leading digit is not set, the byte represents the final digit in the encoded base 128 integer.

To ensure that each integer has a unique representation in the encoding system, 1 is subtracted from all bytes except for the byte representing the last digit.

Decimal Number Hexadecimal Representation Binary
128 0x80 0x00 1000 0000 0000 0000
256 0x81 0x00 1000 0001 0000 0000
65535 0x82 0xFE 0x7F 1000 0010 1111 1110 0111 1111

This system is compact:

  • Integers 0-127 are represented by 1 byte
  • 128-16511 require 2 bytes
  • 16512-2113663 require 3 bytes.

Each integer has a unique encoding, and the encoding is infinite in capacity - integers of any size can be represented.

Worked Manual Example

This example takes a value from the chainstate database of UTXOs from Bitcoin Core and decodes the value to provide:

  • Block height (First Varint, excluding least significant bit)
  • Coinbase status (Last bit of first Varint)
  • Amount (Second Varint)
  • Script type (Third Varint)
  • Unique script value (Remainder of the value)

This worked example is drawn from the README this GitHub repo for a Bitcoin chainstate parser in Ruby.

Start value: c0842680ed5900a38f35518de4487c108e3810e6794fb68b189d8b

First Varint: Block Height

Byte₀ Byte₁ Byte₂
Start, hexadecimal 0xC0 0x84 0x26
Start, binary 1100 0000 1000 0100 0010 0110
Last 7 bits of each byte 100 0000 000 0100 010 0110
Add 1 to each byte except last 100 0001 000 0101 010 0110

tmp array:

0x41 0x05 0x26
0100 0001 0000 0101 0010 0110

Remove last zero - flag showing coinbase status

Byte₀ Byte₁ Byte₂
Concatenate consecutive bits to get value 0000 1000 0010 0001 0101 0011
Result, hexadecimal 0x08 0x21 0x53
Result, decimal 8 33 83

In decimal: (8 * 256²) + (33 * 256) + 83 = 532819

Second Varint: Amount

Byte₀ Byte₁ Byte₂
0x80 0xED 0x59
1000 0000 1110 1101 0101 1001
Last 7 bits 000 0000 110 1101 101 1001
Add 1 to each byte except last 000 0001 110 1110 101 1001
Concatenate 0000 0000 0111 0111 0101 1001
Result, hexadecimal 0x00 0x77 0x59
Result, decimal 0 119 89

In decimal: (0 . 256²) + (119 * 256) + 89 = 30553

Amount Compression

To further save on space Bitcoin Core compresses numbers in the amount field of the UTXO. For this project I've used the Bitcoin Core DecompressAmount function: see here.

References

About

C++ Project that Parses UTXO data from the Bitcoin chainstate Database. Dumps the entire UTXO set to stdout or executes lookup of UTXO based on outpoint.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published