Dshackle Archive

Dshackle Archive is a tool that efficiently extracts blockchain data in JSON format and archives it into Avro files for scalable analysis using traditional Big Data tools. It supports Bitcoin and Ethereum-compatible blockchains and can operate in both batch mode for historical data and streaming mode for real-time data extraction.

Dshackle Archive copies JSON data from a blockchain to a plain files archive (i.e., it’s the Extraction of data, as in ETL). The archive is Avro files which contain blocks and transaction details fetched from blockchain nodes via their API. The Avro contain the JSON responses as is.

Features:

Extracts data from Bitcoin and Ethereum compatible blockchains.
Produces data in Avro format, where one file could be a range of blocks or a separate file per each block
Archival can be scaled by using multiple blockchain nodes by using Dshackle Load Balancer.
Runs in batch archive mode, when historical data is archived; or in a streaming mode when only fresh data is added to the archive
Archive to a Filesystem, Google Storage or S3
Produces Notification via Google PubSub or Apache Pulsar

The general idea it to use Dshackle Archive to copy data from a blockchain to a plain files, keeping the data structures as is, and then use traditional Big Data tools (Spark, Beam, etc.) to analyse data in a scalable way.

Note	It takes several weeks to archive the whole blockchain. It’s a very IO intensive operation, but can be scaled by using multiple nodes.

Note	It uses Dshackle protocol to connect to the blockchain, not a plain HTTP/JSON RPC. So a Dshackle instance is required to be running.

Usage

Command Line Options

usage: dshackle-archive [options] <command>
Copy blockchain data into files for further analysis
    --auth.aws.accessKey <arg>    AWS / S3 Access Key
    --auth.aws.secretKey <arg>    AWS / S3 Secret Key
    --auth.gcp <arg>              Path to GCP Authentication JSON
    --aws.endpoint <arg>          AWS / S3 endpoint url instead of the
                                  default one
    --aws.region <arg>            AWS / S3 region ID to use for requests
    --aws.s3.pathStyle            Enable S3 Path Style access (default is
                                  false)
    --aws.trustTls                Trust any TLS certificate for AWS / S3
                                  (default is false)
 -b,--blockchain <arg>            Blockchain
    --back                        Apply the blocks range (--range option)
                                  back from to the current blockchain
                                  height, i.e. process N..M not starting
                                  not from zero by from the current height
 -c,--connection <arg>            Connection (host:port)
    --compact.forks               Should accept all blocks including forks
                                  in stream compaction
    --compact.ranges              Process range files also in stream
                                  compaction
    --connection.notls            Disable TLS
    --connection.timeout <arg>    Timeout (in seconds) to get data from
                                  blockchain before retrying. Default: 60
                                  seconds
    --continue                    Continue from the last file if set
 -d,--dir <arg>                   Target directory
    --deduplicate                 Deduplicate transactions and blocks
                                  (could increase memory footprint)
    --dirBlocks <arg>             How many blocks keep per subdirectory
    --dryRun                      Do not modify the storage
 -h,--help                        Show Help
 -i,--inputs <arg>                Input File(s). Accepts a glob pattern
                                  for filename
    --include <arg>               Include optional details in JSON (trace,
                                  stateDiff)
    --notify.dir <arg>            Write notifications as JSON line to the
                                  specified dir in a file
                                  <dshackle-archive-%STARTTIME.jsonl>
    --notify.pubsub <arg>         Send notifications as JSON to the
                                  specified Google Pubsub topic
    --notify.pulsar.topic <arg>   Send notifications as JSON to the Pulsar
                                  to the specified topic
                                  (notify.pulsar.url must be specified)
    --notify.pulsar.url <arg>     Send notifications as JSON to the Pulsar
                                  with specified URL (notify.pulsar.topic
                                  must be specified)
    --parallel <arg>              How many blocks to request in parallel.
                                  Range: 1..512. Default: 8
    --prefix <arg>                File prefix
 -r,--range <arg>                 Blocks Range (N...M)
    --rangeChunk <arg>            Range chunk size (default 1000)
    --tail <arg>                  Last T block to ensure are archived in
                                  streaming mode
Available commands:
 archive - the main operation, copies data from a blockchain to archive
 stream  - append fresh blocks one by one to the archive
 compact - merge individual block files into larger range files
 copy    - copy/recover from existing archive by copying into a new one
 report  - show summary on what is in archive for the specified range
 fix     - fix archive by making new archives for missing chunks
 verify  - verify that archive files contains required data and delete incomplete files

Commands

Stream

Continuously append fresh blocks one by one to the archive. In addition to the copying, Dshackle archive can be configured to notify an external system about new blocks in the archive.

Note that when it’s in streaming mode the archives are writen in a per-block basis. I.e., each block comes with in a separate pair of two files. One for the block itself, and another one for all transactions in that block. To merge the individual files into larger ranges use compact command.

To notify an external system, there are two options:

--notify.dir - write notifications as JSON line to the specified dir in a file <dshackle-archive-%STARTTIME.jsonl>
--notify.pubsub - send notifications as JSON to the specified Google Pubsub topic
--notify.pulsar.url + --notify.pulsar.topic - send notifications as JSON to the specified Apache Pulsar topic

See Notification format.

Compact

Merge individual block files into larger range files.

Copy

Copy from one archive to another.

Technically, you can copy files as is, but the command is useful because by using it you can change the range sizes for the target archive. Also, it can be used to recover from a corrupted archive, b/c it makes additional checks and so it skips the corrupted data.

Report

Show summary on what is in archive for the specified range.

Fix

Fixes the archive by checking if there are any missing blocks, and if so, it creates new archives for the missing blocks.

Verify

Verify that archive files contains required data and delete incomplete/corrupted files. The a fix command is supposed to run to download missing blocks.

Warning

This command is destructive, it deletes files from the archive.

Archive Size

Dshackle Archive copies and stored data as JSON responses from blockchain nodes the resulting archive is much larger that the node database size, which keeps data in a compact format. It uses Snappy compression for Avro files, which give a good compression ratio, but still the resulting archive is large.

Average size of a 1000 blocks range (w/o expensive JSON such as stateDiff and trace):

~300Mb for Ethereum
~400Mb for Bitcoin

And the whole archive (w/o expensive JSON such as stateDiff and trace):

~2.5Tb for Ethereum
~1.9Tb for Bitcoin

Avro structure and Java stubs: https://github.com/emeraldpay/dshackle-archive-avro
Dshackle load balancer: https://github.com/emeraldpay/dshackle

Project Roadmap

✓ support AWS S3 as a storage
✓ support Pulsar as a notification system
❏ support Kafka as a notification system
❏ archive to Cassandra

FAQ

How to organize the data gathering process?

First you need to archive the historical data, which may takes several week depending on how many and how fast nodes you have.
After finishing the initial archive, you run in the Streaming mode which append new blocks to the archive as they are mined.
Periodically (ex. once a day) you run Compaction to merge individual block files into larger range files.
Also, periodically (ex. once a day) you run a pair of Verify and Fix commands to ensure the integrity of the archive.

What are supported blockchains?

Dshackle requires only compatibility onj JSON RPC level, so technically it can work with any blockchain that uses similar API. I.e., it’s compatible with all major blockchains, including Bitcoin, Ethereum, Binance Smart Chain, Polygon, etc.

What blockchain API it uses?

It uses Dshackle protocol to connect to the blockchain, not a plain HTTP/JSON RPC. So a Dshackle instance is required to be running.

Dshackle is a Load Balancer for Blockchain APIs, and it can route requests to multiple nodes, which scales up the archival throughput.

How does Dshackle Archive ensure the integrity and accuracy?

Dshackle provides two commands to ensure the integrity of the archive:

first you run verify command, which checks the archive and deletes incomplete or corrupted files
then you run fix command, which copies the data again for the blocks deleted in the previous step

You can schedule the execution of these commands to run periodically, e.g. once a day. To avoid scanning the whole archive every time, you can specify a range to check, e.g. --back --range 100…1100. The option above specifies that is thould verify/fix only the last 1000 blocks, starting from 100 behind the current height. I.e., it goes backward from the current head block.

Archive Format

For a complete descriptions, schema and libs to access Avro files please refer to https://github.com/emeraldpay/dshackle-archive-avro

Block

Fields common between different blockchains

blockchainType - type of blockchain, as a definitions of what fields to expect. One of ETHEREUM or BITCOIN
blockchainId - actual blockchain id (ETH, BTC, etc)
archiveTimestamp - when the archive record was created. Milliseconds since epoch
height - block height
blockId - block hash
timestamp - block timestamp. Milliseconds since epoch
parentId - parent block hash
json - JSON response for that block

Ethereum specific fields

unclesCount - number of uncles for the current block
uncle0Json - JSON for first uncle (eth_getUncleByBlockHashAndIndex(0))
uncle1Json - JSON for second uncle (eth_getUncleByBlockHashAndIndex(1))

Bitcoin specific fields

none

Transaction

Fields common between different blockchains

blockchainType - type of blockchain, as a definitions of what fields to expect. One of ETHEREUM or BITCOIN
blockchainId - blockchain id (ETH, BTC, etc)
archiveTimestamp - when the archive record was created. Milliseconds since epoch
height - block height
blockId - block hash
timestamp - block timestamp. Milliseconds since epoch
index - index of the transaction in block
txid - hash or transaction id of the transaction
json - JSON response for that transaction
raw - raw bytes of the transaction

Ethereum specific fields

from - from address
to - to address
receiptJson - JSON response for eth_getTransactionReceipt
traceJson - JSON response for trace_replayTransaction(trace)
stateDiffJson - JSON response for trace_replayTransaction(stateDiff)

Bitcoin specific fields

none

Notification format

{
  "version":"https://schema.emrld.io/dshackle-archive/notify",
  "ts":"2022-05-20T23:14:24.481327Z",
  "blockchain":"ETH",
  "type":"transactions",
  "run":"stream",
  "heightStart":14813875,
  "heightEnd":14813875,
  "location":"gs://my-bucket/blockchain-archive/eth/014000000/014813000/014813875.txes.avro"
}

Where

version id of the current JSON format
ts timestamp of the archive event
blockchain blockchain
type type of file (transactions or blocks)
run mode in which the Dshackle Archive is run (archive, stream, copy or compact)
heightStart and heightEnd range of blocks in the archived files
location a URL to the archived file

Community

Development Chat

Join our Discord chat to discuss development and ask questions:

Commercial Support

Want to support the project, prioritize a specific feature, or get commercial help with using Dshackle in your project? Please contact [email protected] to discuss the possibility.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
gradle/wrapper		gradle/wrapper
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.adoc		README.adoc
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dshackle Archive

Usage

Command Line Options

Commands

Archive

Stream

Compact

Copy

Report

Fix

Verify

Archive Size

Project Roadmap

FAQ

How to organize the data gathering process?

What are supported blockchains?

What blockchain API it uses?

How does Dshackle Archive ensure the integrity and accuracy?

Archive Format

Block

Transaction

Notification format

Community

Development Chat

Commercial Support

License

About

Releases 1

Packages

Contributors 3

Languages

License

emeraldpay/dshackle-archive

Folders and files

Latest commit

History

Repository files navigation

Dshackle Archive

Usage

Command Line Options

Commands

Archive

Stream

Compact

Copy

Report

Fix

Verify

Archive Size

Related projects:

Project Roadmap

FAQ

How to organize the data gathering process?

What are supported blockchains?

What blockchain API it uses?

How does Dshackle Archive ensure the integrity and accuracy?

Archive Format

Block

Transaction

Notification format

Community

Development Chat

Commercial Support

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages