Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facilitate multiple harvesters #36

Open
tpmccallum opened this issue May 29, 2019 · 2 comments
Open

Facilitate multiple harvesters #36

tpmccallum opened this issue May 29, 2019 · 2 comments
Labels
enhancement New feature or request long term Valuable enhancement for the future reliability and performance Tasks which relate to reliability of harvester and frontend

Comments

@tpmccallum
Copy link
Contributor

The idea behind this is as follows

  • shard the work load so that each of them harvest a few ABIs
  • Have an "in_progress" field in the "all" index (where the ABIs are stored) which indicates whether that unique ABI is currently being processed by an individual harvester
  • At the beginning of all harvesting, each of the harvesters can grab one unprocessed ABI (from the "all" index), sleep for a few seconds and then go back for seconds and thirds and so on
  • If there is only one harvester it will eventually just end up with all of the ABIs (keeping in mind that it alone will process these via multi-threading so this will be quite efficient anyway)
  • If there are many harvesters they will all have a percentage of the overall available unique ABIs to harvest
  • With regards to the harvester going offline, a simple watcher script (JS or python) could test the epoch against current time and switch the "in-progress" field from true to false
  • This means of course that the harvester would periodically update the epoch field to ensure that the watcher did not set the "in-progress" field to false
@tpmccallum tpmccallum added enhancement New feature or request long term Valuable enhancement for the future reliability and performance Tasks which relate to reliability of harvester and frontend labels May 29, 2019
@tpmccallum
Copy link
Contributor Author

tpmccallum commented May 31, 2019

Some notes from the most recent harvest_all.py code

# indexingInProgress is a placeholder/lock for sharded harvesting in the future
outerData['indexingInProgress'] = "false"
# contractDestructed will be set by external web3 script
outerData['contractDestructed'] = "false"
# epochOfLastUpdate will be updated by each sharded harvester/indexer and used to detect if a sharded harvester/indexer has gone off line with the indexingInProgress flag set to true
# epochOfLastUpdate will be monitored by external script (the purpose to set the indexingInProgress to false in cases where no recent activity is detected)
# i.e. if contractDestructed == "false" and indexingInProgress == "true" and (time.now - epochOfLastUpdate > 24 hours)
outerData['epochOfLastUpdate'] = block.timestamp

@tpmccallum
Copy link
Contributor Author

tpmccallum commented Jun 3, 2019

Rather than setting up locks like "indexingInProgress" and needing watchers for elapsed time fields like "epochOfLastUpdate", after much consideration I fee that it would be more robust and efficient to assign each harvester to a subset of records. This way we are not constantly hitting the index with updates to "indexingInProgress" and "epochOfLastUpdate" and having to maintain external watcher scripts and clean up scripts.

This new idea would be as follows.
Once an environment of independent harvesters/indexers is established i.e. 5 separate machines an independent script would quickly loop through all of the records and repeatedly sign off numbers 1 through 5. At the end of this process each record would have a single number assigned to it. Each of the harvesters/indexers would run using the number i.e. indexer one would process all of the "1" records, indexer two would process all of the "two" records and so on.

What makes this powerful is that each of the indexers is responsible for maintaining its own set of contract addresses (and web3 ABI instances associated to those contract addresses). Instantiating the web3 contract instances is one of the major bottle necks and so randomly getting addresses to process from the queue actually increases this overhead. It is better for individual harvesters to take ownership of their set of addresses and maintain a cache of the instances for quick and efficient access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request long term Valuable enhancement for the future reliability and performance Tasks which relate to reliability of harvester and frontend
Projects
None yet
Development

No branches or pull requests

1 participant