You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This step involves adding the foundation to support forks: Configuration (DataAccess, ChainConfiguration etc.), loading of chains, traversing blocks and transactions for both the main chain and forked chains.
The parser is not modified in this step such that I can get familiar with the code without straight touching the (more complex) parser code.
Enabling fork-support that allows iterating blocks and transactions should still be possible.
The intended procedure is as follows:
BTC and BCH are parsed individually to separate parser output directories.
The config file of the main chain (BTC) points to the data directory of the forked chain (BCH), and vice versa.
BlockSci is modified to load both chains and support traversing of blocks and transactions. The forked chain re-uses the block/tx-data from the parent chain.
This should work without changing the parser. Thus, some hash/script/address-related code won't work correctly yet, also following the transaction graph (getSpendingTx) won't work, as that all needs parser modification.
In pseudo code, what should work is the following:
# internally load and use the pre-fork data of BTC blockchain=blocksci.BlockChain("/path/to/bch")
forblockinblocks:
fortxinblock:
# do something with the transaction# data from >= fork block height is loaded from the /path/to/bch dir# data from < fork block height is loaded from the /path/to/btc dir
[TODO] Chain and Fork Configuration
Every chain (whether it is a fork or a root/parent chain) has its own parser output directory and its own config file.
Even though not all current directories (script/ hashIndex/) will be used for all chains, see Step 2
When we phoned I had the impression that you prefer the directories being separated. Let me know if your opinion has changed.
Most relevant BlockSci classes have access to its chain's DataAccess object. Thus, I'd make sense to add the fork configuration in there.
Regarding the JSON config file, data can be added to chainConfig and its corresponding class ChainConfiguration, also accessible via DataAccess.
However, I could not yet come up with a clean solution to storing (how and where) the nested chain configs.
Assume a user has parsed BTC -> BCH -> BSV but just wants to load up BCH using new Blockchain("path/to/bch")
The intended result is a single Blockchain object, but the config of (at least) all parent chain(s) upto and including the root chain have to be loaded as well, in order to load the required parts of parent chain data. Doing this results in a tree of chains (or chain configs).
!! I see various options, but have yet to figure out what is most appropriate. Any ideas?
BCH's DataConfiguration object held by the DataAccess object could contain a DataConfiguration.children as well as DataConfiguration.parent to reflect the a tree of objects.
Information that is required to load/parse a fork:
Block height at which the fork occured
Other needed information that can be derived from i. and existing blockchain data:
Tx index of the first "new" (= not shared) transaction
Input index of the first "new" transaction
Root (=Main/Parent) Chain
Added keys:
chainConfig.forkDataDirectories: Set the parser output directories that contain data of direct forks.
/storage/blocksci/bitcoin/
├── addressesDb # unchanged, as not considered in step 1
├── chain # chain data of BTC, in the current format of BlockSci
│ ├── block.dat
│ └── ...
├── config.json # contains "links" to the directories of all forks (forkDataDirectories)
│ # here: link to /storage/blocksci/bitcoin-cash, as shown below
├── hashIndex # unchanged, as not considered in step 1
├── mempool # unchanged, as not considered in step 1
├── parser # unchanged, as not considered in step 1
└── scripts # unchanged, as not considered in step 1
/storage/blocksci/bitcoin-cash/
├── addressesDb # unchanged, as not considered in step 1
├── chain # chain data of BTC, in the current format of BlockSci
│ └── ...
├── config.json # contains a "link" to the parent chain directory (parentChainDirectory)
│ # here: /storage/blocksci/bitcoin/
└── ...
Step 2: Adapt Parser to Support Forks
Step 2 involves the adaption of the parser to parse forks correctly.
The proposed process starts by parsing a user-specified root chain, eg. BTC. After that, forks of BTC can be added to the configuration, either manually or via a command-line option. Parser update calls should then parse all configured chains, including the root (BTC) chain. Added forks can be parsed "on top" of the existing data. Already existing data for other chains (eg. scripts) are reused, new data is added.
Forks have their own config file and parser output directory, but some of the parsed output data (eg. the HashIndex) is only stored in the directory of the root chain. The corresponding directories of the fork may remain empty.
chain/ Directory (ChainAccess)
All files can contain data for the entire individual chain
Fields like Inout.toAddressNum are adjusted to match the data of the fork
Optimization option: Only store new (= after fork) blocks/transactions etc. on disk. Not a priority.
A fork uses the parent's files for data upto the fork block height, as already outlined in the first proposal
The idea is if blockHgt < forkBlockHgt: parent.block(blockHgt) else: fork.block(blockHgt)
The approach is similar for most ChainAccess methods
Open question: Is it better to change the current ChainAccess itself, or create a derived class like ChainAccessFork? I guess the decision will (also) depend a lot on performance, as having a ChainAccess* pointer in DataAccess that can point to both ChainAccess and ChainAccessFork may introduce expensive indirection. For now, In the commited code I edited ChainAccess itself.
Issue: Output.getSpendingTx()
Problem:
Given that BTC's transactions and its in-/outputs are stored in tx_data.dat, assume that the BCH instance should re-use this data (upto the fork block height).
Regarding Output.getSpendingTx(), re-using data from the parent chain works correctly for all outputs that are spent before the fork. However, for outputs that are unspent (UTXO) at the time of the fork, separate storage of the Inout.linkedTxNum for the forked chain is needed, as the output can be spent in both chains individually.
Proposed Solutions
The order represents my preference in descending order. The additional storage requirements are calculated for a fork of BTC at the block height of the BCH fork.
Duplicate just the uint32_t Inout.linkedTxNum field for all outputs of the parent chain upto the fork block height
Add a new memory-mapped file that contains the Inout.linkedTxNum fields of all pre-fork transactions, eg. chain/inoutLinkedTxNum.dat
The root chain can use the original Inout data in /rootchain/chain/tx_data.dat, and every fork has a separate inoutLinkedTxNum.dat file for each of its parent chains.
Pros: less storage overhead than 3., namely 662,670,387 pre-fork outputs × 4 bytes = 2.7GB per chain *
Cons: more complex to implement than 3. (and 2.), as the Inout data is "split", similar as transaction data is stored in various locations and put together in TxData.
Duplicate only affected (=unspent-at-fork-height outputs) Inout objects (or just the Inout.linkedTxNum) for the fork upto the fork block height. This can't be done with a memory mapped file, as the data is scattered and can't be indexed by offset.
An option would be to serialize the entire C++ object that holds this data - similar to the ChainIndex class, with its std::unordered_map<blocksci::uint256, BlockType>.)
Another option is to save the data into RocksDB, but I guess that hurts performance (too much)?
Duplicate all Inouts for the fork upto the fork block height
Eg. a separate chain/inout.dat file, but without RawTransaction data.
Pros: easy to implement, faster than 3.
Cons: large storage and memory overhead and thus, most likely not a good choice
(662,670,387 × 16 byte = 10.6GB *)
Do you have any ad-hoc thoughts or ideas on that issue and the proposed options?
The scripts in scripts/ are already de-duplicated in the current version of BlockSci.
One single scripts/ directory could be used for all related chains, stored in the directory of the root (top-most) chain, eg. /storage/bitcoin/scripts
Script data can be reused for all related chains
This is done by setting the Inout.toAddressNum to the correct scriptNum for all related chains during parsing
If a new script in any of the related chains comes along, a new ID is assigned, otherwise data is reused
Issue: txFirstSeen and txFirstSpent of ScriptDataBase
The uint32_t ScriptDataBase.txFirstSeen and uint32_t ScriptDataBase.txFirstSpent fields may be different for every fork.
I think this applies to uint32_t ScriptDataBase.typesSeen as well, right? I'm not sure what typesSeen is for. Probably which address types have been seen for the given public key?
Solution: Create separate file(s) for txFirstSeen and txFirstSpent for every chain and address type.
scripts/pubkey_script_firstSeenSpent.dat; or separate files for firstSeen and firstSpent
scripts/scripthash_script_firstSeenSpent.dat; or separate files for firstSeen and firstSpent
etc.
Do you prefer separate files for both fields, or combine them into one file per type?
Required additional storage: approx. +5GB needed for every assumed BTC fork at the current height
Storage requirement for other script types is minor
hashIndex/ Directory (HashIndex)
All HashIndex data can be stored in the root (top-most) chain data directory. Thus, there is just one HashIndex for all related chains. The forked chains access the HashIndex of the root chain or have their own objects to the same index.
The HashIndex currently contains / maps:
txHash -> txNum
Required changes: Add a chainId byte to the key as a suffix, such that the database becomes (txHash, chainId) -> txNum. This is needed to retrieve duplicate (replayed) transactions for all related chains. (Transaction.getDuplicates() and Transaction.hasDuplicates())
The chain id is fixed for every chain, see include/blocksci/core/chain_ids.hpp. Eg. BTC=1, BTC_TEST=2, ... Is a 1-byte chain ID enough or should I go with 2 bytes for future compatibility?
Required changes: I think none, as address identifiers are equal accross related chains and are represented as the same script and thus, have the same scriptNum. Eg. the same public key can be used accross all chains, stored as one scriptNum that is used for all related chains.
addressesDb/ Directory (AddressIndex)
As for hashIndex/ and Script/ I think one addressesDb/ directory can be used for all related chains.
The AddressIndex currently contains / maps:
Address -> [Outputs sent to this address]
Internally just the key (scriptNum, txNum, outputNumInTx), values are empty
Required changes: This seems tricky. The scriptNums are reused accross all related chains, so for every scriptNum we need to distinguish for which chain we are adding an output. The chainId field could be added after scriptNum, resulting in the key (scriptNum, chainid, txNum, outputNumInTx). However, that would essentially result in a lot of redundant data. When a new fork is added, the exsting entries of the parent chain have to be duplicated with the chainId of the fork.
How the parser output files are used, changed, or added in step 2. New files are marked as *NEW.
btc_parser_output/
├── addressesDb #
├── [ 59G] chain
│ ├── [ 46M] block.dat
│ ├── [ 26M] coinbases.dat
│ ├── [2.7G] firstInput.dat
│ ├── [2.7G] firstOutput.dat
│ ├── [1.7G] input_out_num.dat
│ ├── [3.4G] sequence.dat
│ ├── [ 33G] tx_data.dat
│ ├── [*NEW] inoutLinkedTxNum.dat
│ ├── [ 11G] tx_hashes.dat
│ ├── [2.7G] tx_index.dat
│ └── [1.3G] tx_version.dat
│
├── config.json # contain "links" to the directories of all forks, here just bch_parser_output/
├── hashIndex #
├── mempool #
├── parser #
└── [ 45G] scripts # files contain the scripts of _all_ related chains, not just of BTC
├── [1.5G] multisig_script_data.dat
├── [435M] multisig_script_index.dat
├── [ 52M] nonstandard_script_data.dat
├── [8.6M] nonstandard_script_index.dat
├── [385M] null_data_script_data.dat
├── [ 65M] null_data_script_index.dat
├── [ 39G] pubkey_script.dat
├── [*NEW] pubkey_firstSeenSpent.dat
├── [3.9G] scripthash_script.dat
└── [*NEW] scripthash_firstSeenSpent.dat
bch_parser_output/
├── addressesDb #
├── chain # chain data of BCH, all data remains as it currently is
│ ├── ... # all existing chain/ files, see above
│ └── [*NEW] inoutLinkedTxNum.dat # the Inout.linkedTxNum field for pre-fork in- and ouputs
│
├── config.json # contains a link to the parent chain directory, here just btc_parser_output/
│
├── hashIndex # only the hashIndex of the root chain is used, so this directory is empty
└── scripts # only scripts/ of the root chain is used, so this directory is empty
An option to consider would be to also maintain separate hashIndex and addressesDb for every fork - as if the fork has been parsed individually without including related chains. That would allow loading only the forked chain without having the performance penalty of the fork extension.
Interface Changes
Proposed interface changes can be found on
ChainAccess
Block
uint8_t getChainId() const: 1-byte ID for every chain, eg. BTC=1, BTC_TEST=2, LITECOIN=5, ...
This ID (byte) is used in the data structures, eg. in the hashIndex as a suffix to the tx hash
At the moment the IDs are stored in include/blocksci/core/chain_ids.hpp, I'm not sure if the current static-consts-inside-a-class approach is a good choice, but it's trivial to change it.
Transaction
std::vector<Transaction> getDuplicates() const; get transactions with the same tx hash on another chain (forks without replay protection). can be implemented as a range query on HashIndex.
bool hasDuplicates() const;: check whether other transactions with the same tx hash exist. can be implemented as a range query on HashIndex as well
uint8_t getChainId() const;: see above (Block class)
std::vector<std::pair<uint8_t, int64_t>> calculateBalances(BlockHeight height) const;: get balances for all chains by chainId
Testing
I wondered what's the best way to test my extension (once I'm there). I think working with the original chain is cumbersome on a development machine (16GB RAM), as BCH only forks at block height 478,558 and thus, a lot of data has to be loaded/parsed after every code modification.
Do you have any input on that, what data do you use for testing and development? Local? On a server?
Maybe I can use BTC and BCH testnets, if the fork is reflected there nicely, I'm not sure.
Malte's new testchain-generator also looks interesting. Maybe I can extend it and use it to generate 2 chains for testing purposes, what do you think?
The text was updated successfully, but these errors were encountered:
Updated Implementation Proposal
Table of Contents
ChainAccess
)Output.getSpendingTx()
ScriptAccess
)ScriptDataBase
HashIndex
)AddressIndex
)General
Relevant Numbers
Some numbers of BTC that are relevant to estimate performance and storage/memory usage of the proposed changes:
At the block height of the BCH Fork (478,558):
RawTransaction
data)Inout
data)Inout
data)At the current block height (551,912):
RawTransaction
data)Inout
data)Inout
data)Step 1: Add Basic Fork-Support
DataAccess
,ChainConfiguration
etc.), loading of chains, traversing blocks and transactions for both the main chain and forked chains.[TODO] Chain and Fork Configuration
DataAccess
object. Thus, I'd make sense to add the fork configuration in there.ChainConfiguration
, also accessible viaDataAccess
.new Blockchain("path/to/bch")
DataConfiguration
object held by theDataAccess
object could contain aDataConfiguration.children
as well asDataConfiguration.parent
to reflect the a tree of objects.Root (=Main/Parent) Chain
chainConfig.forkDataDirectories
: Set the parser output directories that contain data of direct forks./storage/blocksci/bitcoin/config.json:
Forked Chain
chainConfig.parentChainDirectory
: Set the parser output directory of the parent chainchainConfig.forkBlockHeight
: Set the block height at which the fork occuredchainConfig.forkDataDirectories
: (Optionally) set the parser output directories of additional forks/storage/blocksci/bitcoin-cash/config.json:
Parser Output Files
Overview of the parser output files of step 1:
Step 2: Adapt Parser to Support Forks
Step 2 involves the adaption of the parser to parse forks correctly.
The proposed process starts by parsing a user-specified root chain, eg. BTC. After that, forks of BTC can be added to the configuration, either manually or via a command-line option. Parser
update
calls should then parse all configured chains, including the root (BTC) chain. Added forks can be parsed "on top" of the existing data. Already existing data for other chains (eg. scripts) are reused, new data is added.Forks have their own config file and parser output directory, but some of the parsed output data (eg. the
HashIndex
) is only stored in the directory of the root chain. The corresponding directories of the fork may remain empty.chain/ Directory (
ChainAccess
)Inout.toAddressNum
are adjusted to match the data of the forkif blockHgt < forkBlockHgt: parent.block(blockHgt) else: fork.block(blockHgt)
ChainAccess
methodsChainAccess
itself, or create a derived class likeChainAccessFork
? I guess the decision will (also) depend a lot on performance, as having aChainAccess*
pointer inDataAccess
that can point to bothChainAccess
andChainAccessFork
may introduce expensive indirection. For now, In the commited code I editedChainAccess
itself.Issue:
Output.getSpendingTx()
Problem:
Given that BTC's transactions and its in-/outputs are stored in tx_data.dat, assume that the BCH instance should re-use this data (upto the fork block height).
Regarding
Output.getSpendingTx()
, re-using data from the parent chain works correctly for all outputs that are spent before the fork. However, for outputs that are unspent (UTXO) at the time of the fork, separate storage of theInout.linkedTxNum
for the forked chain is needed, as the output can be spent in both chains individually.Proposed Solutions
The order represents my preference in descending order. The additional storage requirements are calculated for a fork of BTC at the block height of the BCH fork.
uint32_t Inout.linkedTxNum
field for all outputs of the parent chain upto the fork block heightInout.linkedTxNum
fields of all pre-fork transactions, eg. chain/inoutLinkedTxNum.datInout
data in /rootchain/chain/tx_data.dat, and every fork has a separate inoutLinkedTxNum.dat file for each of its parent chains.TxData
.Inout
objects (or just theInout.linkedTxNum
) for the fork upto the fork block height. This can't be done with a memory mapped file, as the data is scattered and can't be indexed by offset.ChainIndex
class, with itsstd::unordered_map<blocksci::uint256, BlockType>
.)(662,670,387 × 16 byte = 10.6GB *)
Do you have any ad-hoc thoughts or ideas on that issue and the proposed options?
* Calculated for the BCH fork, see Relevant Numbers Section
scripts/ Directory (
ScriptAccess
)Inout.toAddressNum
to the correct scriptNum for all related chains during parsingIssue: txFirstSeen and txFirstSpent of
ScriptDataBase
uint32_t ScriptDataBase.txFirstSeen
anduint32_t ScriptDataBase.txFirstSpent
fields may be different for every fork.uint32_t ScriptDataBase.typesSeen
as well, right? I'm not sure what typesSeen is for. Probably which address types have been seen for the given public key?PubkeyData
)) × 4 bytes × 2 variableshashIndex/ Directory (
HashIndex
)HashIndex
data can be stored in the root (top-most) chain data directory. Thus, there is just oneHashIndex
for all related chains. The forked chains access theHashIndex
of the root chain or have their own objects to the same index.The
HashIndex
currently contains / maps:chainId
byte to the key as a suffix, such that the database becomes (txHash, chainId) -> txNum. This is needed to retrieve duplicate (replayed) transactions for all related chains. (Transaction.getDuplicates()
andTransaction.hasDuplicates()
)addressesDb/ Directory (
AddressIndex
)The
AddressIndex
currently contains / maps:Parser Output Files
How the parser output files are used, changed, or added in step 2. New files are marked as
*NEW
.An option to consider would be to also maintain separate hashIndex and addressesDb for every fork - as if the fork has been parsed individually without including related chains. That would allow loading only the forked chain without having the performance penalty of the fork extension.
Interface Changes
ChainAccess
Block
uint8_t getChainId() const
: 1-byte ID for every chain, eg. BTC=1, BTC_TEST=2, LITECOIN=5, ...Transaction
std::vector<Transaction> getDuplicates() const;
get transactions with the same tx hash on another chain (forks without replay protection). can be implemented as a range query on HashIndex.bool hasDuplicates() const;
: check whether other transactions with the same tx hash exist. can be implemented as a range query on HashIndex as welluint8_t getChainId() const;
: see above (Block class)Input
std::vector<std::pair<uint8_t, Transaction>> getSpentTxes() const;
std::vector<std::pair<uint8_t, Transaction>> getSpentTxes(std::vector<uint8_t> chainIds) const;
uint8_t getChainId() const;
Output
std::vector<uint32_t> getSpendingTxIndexes() const;
uint8_t getChainId() const;
std::vector<Transaction> getSpendingTxes() const;
std::vector<Input> getSpendingInputs() const;
std::vector<InputPointer> getSpendingInputPointers() const;
Address
std::vector<std::pair<uint8_t, int64_t>> calculateBalances(BlockHeight height) const;
: get balances for all chains by chainIdTesting
I wondered what's the best way to test my extension (once I'm there). I think working with the original chain is cumbersome on a development machine (16GB RAM), as BCH only forks at block height 478,558 and thus, a lot of data has to be loaded/parsed after every code modification.
The text was updated successfully, but these errors were encountered: