Facilitate multiple harvesters #36

tpmccallum · 2019-05-29T23:38:34Z

The idea behind this is as follows

shard the work load so that each of them harvest a few ABIs
Have an "in_progress" field in the "all" index (where the ABIs are stored) which indicates whether that unique ABI is currently being processed by an individual harvester
At the beginning of all harvesting, each of the harvesters can grab one unprocessed ABI (from the "all" index), sleep for a few seconds and then go back for seconds and thirds and so on
If there is only one harvester it will eventually just end up with all of the ABIs (keeping in mind that it alone will process these via multi-threading so this will be quite efficient anyway)
If there are many harvesters they will all have a percentage of the overall available unique ABIs to harvest
With regards to the harvester going offline, a simple watcher script (JS or python) could test the epoch against current time and switch the "in-progress" field from true to false
This means of course that the harvester would periodically update the epoch field to ensure that the watcher did not set the "in-progress" field to false

tpmccallum · 2019-05-31T06:31:49Z

Some notes from the most recent harvest_all.py code

# indexingInProgress is a placeholder/lock for sharded harvesting in the future
outerData['indexingInProgress'] = "false"
# contractDestructed will be set by external web3 script
outerData['contractDestructed'] = "false"
# epochOfLastUpdate will be updated by each sharded harvester/indexer and used to detect if a sharded harvester/indexer has gone off line with the indexingInProgress flag set to true
# epochOfLastUpdate will be monitored by external script (the purpose to set the indexingInProgress to false in cases where no recent activity is detected)
# i.e. if contractDestructed == "false" and indexingInProgress == "true" and (time.now - epochOfLastUpdate > 24 hours)
outerData['epochOfLastUpdate'] = block.timestamp

tpmccallum · 2019-06-03T09:15:49Z

Rather than setting up locks like "indexingInProgress" and needing watchers for elapsed time fields like "epochOfLastUpdate", after much consideration I fee that it would be more robust and efficient to assign each harvester to a subset of records. This way we are not constantly hitting the index with updates to "indexingInProgress" and "epochOfLastUpdate" and having to maintain external watcher scripts and clean up scripts.

This new idea would be as follows.
Once an environment of independent harvesters/indexers is established i.e. 5 separate machines an independent script would quickly loop through all of the records and repeatedly sign off numbers 1 through 5. At the end of this process each record would have a single number assigned to it. Each of the harvesters/indexers would run using the number i.e. indexer one would process all of the "1" records, indexer two would process all of the "two" records and so on.

What makes this powerful is that each of the indexers is responsible for maintaining its own set of contract addresses (and web3 ABI instances associated to those contract addresses). Instantiating the web3 contract instances is one of the major bottle necks and so randomly getting addresses to process from the queue actually increases this overhead. It is better for individual harvesters to take ownership of their set of addresses and maintain a cache of the instances for quick and efficient access.

tpmccallum added enhancement New feature or request long term Valuable enhancement for the future reliability and performance Tasks which relate to reliability of harvester and frontend labels May 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Facilitate multiple harvesters #36

Facilitate multiple harvesters #36

tpmccallum commented May 29, 2019

tpmccallum commented May 31, 2019 •

edited

Loading

tpmccallum commented Jun 3, 2019 •

edited

Loading

Facilitate multiple harvesters #36

Facilitate multiple harvesters #36

Comments

tpmccallum commented May 29, 2019

tpmccallum commented May 31, 2019 • edited Loading

tpmccallum commented Jun 3, 2019 • edited Loading

tpmccallum commented May 31, 2019 •

edited

Loading

tpmccallum commented Jun 3, 2019 •

edited

Loading