Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor batch building for better performance and concurrency #6

Closed
anomit opened this issue Aug 29, 2024 · 4 comments
Closed

Refactor batch building for better performance and concurrency #6

anomit opened this issue Aug 29, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@anomit
Copy link
Member

anomit commented Aug 29, 2024

Describe the bug

For every epoch, eligible submissions submitted within the deadline are

  • bundled into batches
  • merkle trees of submission IDs and finalized CIDs created against each of these batches
  • batches are uploaded to IPFS
  • their proofs anchored to protocol state contract

The above processes run into bottle necks in case of batch size threshold being low, network issues with IPFS or anchor contract calls along with other CPU bound activity that does not utilize the benefits that come with using go routines completely.

To Reproduce

Affected versions:

Steps to reproduce the behavior:
As described above

Expected behavior

Batch building along with updating of submission counts can be greatly simplified by using worker groups, along with setting a minimum permissible threshold for the number of submissions that should be included in a batch.

Proposed Solution
WIP.

Caveats
WIP. To be expanded.

Additional context
NA

@anomit anomit added the bug Something isn't working label Aug 29, 2024
@Sulejman
Copy link
Contributor

Sulejman commented Sep 2, 2024

I've written few tests based on real world scenario that happened when batching broke last time and they are contained here:

https://github.com/PowerLoom/submission-sequencer-batcher/compare/feat/batching-tests?expand=1

Cause of an issue is most probably added delay on ipfs store response time at certain period of time, atm I am implementing changes to batching logic that will help avoid this issue even when we have to wait longer for ipfs response.

@anomit
Copy link
Member Author

anomit commented Sep 22, 2024

After testing the changes pushed so far in staging environment, the following are some of the issues that have been observed.

The stress test was to fire, per epoch, 720 submissions from a full node as well as 400 submissions from 200 lite nodes = 1120 snapshot submissions

Nonces out of order and missing tx receipts

This has been observed to be mitigated by lowering the BATCH_SIZE deployment parameter to 50 from 250 and assigning 5 signer accounts for batch submissions instead of 2.

Larger batch sizes cause a batch submission transaction with larger arrays of project IDs and finalized CIDs. This can cause the tx to be dropped because of block space limits.

Following logs show such a situation in progress. Full logs attached.

time="2024-09-16T13:53:37Z" level=debug msg="Fetched 8 transactions for epoch 8496" func="collector/pkgs/helpers/prost.(*TxManager).EnsureBatchSubmissionSuccess" file="/app/pkgs/helpers/prost/txManager.go:503"
time="2024-09-16T13:53:38Z" level=error msg="Receipt not found in Redis for tx 0xf54950f63544918cb32e88f48a0b92000256e4412791976da2891a79769a4fea" func="collector/pkgs/helpers/prost.(*TxManager).GetTxReceipt.func1" file="/app/pkgs/helpers/prost/txManager.go:76"
time="2024-09-16T13:53:39Z" level=error msg="Receipt not found in Redis for tx 0xf54950f63544918cb32e88f48a0b92000256e4412791976da2891a79769a4fea" func="collector/pkgs/helpers/prost.(*TxManager).GetTxReceipt.func1" file="/app/pkgs/helpers/prost/txManager.go:76"
time="2024-09-16T13:53:40Z" level=error msg="Receipt not found in Redis for tx 0xf54950f63544918cb32e88f48a0b92000256e4412791976da2891a79769a4fea" func="collector/pkgs/helpers/prost.(*TxManager).GetTxReceipt.func1" file="/app/pkgs/helpers/prost/txManager.go:76"
time="2024-09-16T13:53:41Z" level=error msg="Receipt not found in Redis for tx 

nonce_errors.log

Batch submissions being retried for the same batch ID and epoch ID

With appropriate sized batches and signers, even though batch submissions seem to be going through, the same batch submission gets retried multiple times leading to failed txs.

Failed txs:
https://explorer-prost1m.powerloom.io/tx/0x926eff56670ae98670188c34d64d0f11bca7b20a4e883c2fbd0fdf55d2a3eef7
https://explorer-prost1m.powerloom.io/tx/0xe6fa8187b3985de801c62cf27507266ecdb3e31217c5793172fa10dffa17fbba
https://explorer-prost1m.powerloom.io/tx/0x83fecb98fb4438ee03b21814b3a1871e7d7116b6b205f19fc8cdb377d598a77f

End batch submissions sent for an epoch without any batches actually being submitted

https://explorer-prost1m.powerloom.io/tx/0x83a5e34b95a4d52eb0e8086d493ad9eaccb7867272ae8d0c4f50bd3312dfeac7

Screenshot 2024-09-22 at 9 00 43 PM

IPFS nodes going out of order means batches were not uploaded, yet an end of batch submission was indicated.

Following are logs for batch upload failures.

func=collector/pkgs/helpers/clients.SendFailureNotification file="/app/pkgs/helpers/clients/reporting.go:70"
time="2024-09-22T15:31:51Z" level=error msg="Error storing batch on IPFS: 1" func=collector/pkgs/helpers/merkle.BuildBatch file="/app/pkgs/helpers/merkle/merkle.go:267"
time="2024-09-22T15:31:51Z" level=debug msg="Reporting service response status:  200 OK" func=collector/pkgs/helpers/clients.SendFailureNotification file="/app/pkgs/helpers/clients/reporting.go:70"
time="2024-09-22T15:31:51Z" level=error msg="Error storing batch on IPFS: 1" func=collector/pkgs/helpers/merkle.BuildBatch file="/app/pkgs/helpers/merkle/merkle.go:267"
time="2024-09-22T15:31:51Z" level=debug msg="Reporting service response status:  200 OK" func=collector/pkgs/helpers/clients.SendFailureNotification file="/app/pkgs/helpers/clients/reporting.go:70"
time="2024-09-22T15:31:51Z" level=debug msg="Reporting service response status:  200 OK" func=collector/pkgs/helpers/clients.SendFailureNotification file="/app/pkgs/helpers/clients/reporting.go:70"
time="2024-09-22T15:31:51Z" level=error msg="Error storing the batch:  Post \"http://ip-172-31-20-190.us-east-2.compute.internal:5001/api/v0/add?\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" func=collector/pkgs/helpers/merkle.finalizeBatches.func1 file="/app/pkgs/helpers/merkle/merkle.go:179"
time="2024-09-22T15:31:51Z" level=error msg="Error storing batch on IPFS: 1" func=collector/pkgs/helpers/merkle.BuildBatch file="/app/pkgs/helpers/merkle/merkle.go:267"

which is followed by a single transaction that indicates end of batch submission.

time="2024-09-22T16:17:45Z" level=debug msg="Verifying all batch submissions" func=collector/pkgs/helpers/prost.triggerCollectionFlow file="/app/pkgs/helpers/prost/processor.go:241"
time="2024-09-22T16:17:45Z" level=debug msg="Processing block:  2489386" func=collector/pkgs/helpers/prost.StartFetchingBlocks file="/app/pkgs/helpers/prost/chain.go:77"
time="2024-09-22T16:17:45Z" level=debug msg="Locking account:  0x53873f58840Ea024ebE2E44f499b1195eF96310d" func="collector/pkgs/helpers/prost.(*AccountHandler).GetFreeAccount" file="/app/pkgs/helpers/prost/accountHandler.go:119"
time="2024-09-22T16:17:45Z" level=debug msg="No transactions remaining for epochID:  1025" func="collector/pkgs/helpers/prost.(*TxManager).EnsureBatchSubmissionSuccess" file="/app/pkgs/helpers/prost/txManager.go:481"
time="2024-09-22T16:17:45Z" level=debug msg="Releasing account:  0x53873f58840Ea024ebE2E44f499b1195eF96310d" func="collector/pkgs/helpers/prost.(*AccountHandler).ReleaseAccount" file="/app/pkgs/helpers/prost/accountHandler.go:142"
time="2024-09-22T16:17:45Z" level=debug msg="Locking account:  0x53873f58840Ea024ebE2E44f499b1195eF96310d" func="collector/pkgs/helpers/prost.(*AccountHandler).GetFreeAccount" file="/app/pkgs/helpers/prost/accountHandler.go:119"

Delay of ~2 epochs in batches being built

Following are the logs which indicate that 2 minutes after epoch 1025 was released, the batch building began for epoch 1024. That is inconsistent with the expectation that the process for 1024 should have begun by the time the epoch release for 1024 arrived. This is effectively a delay of >=2 epochs in the batches being built and ultimately submitted.

time="2024-09-22T16:09:37Z" level=debug msg="Epoch Released at block 2489142: 1025\n" func=collector/pkgs/helpers/prost.ProcessEvents file="/app/pkgs/helpers/prost/processor.go:55"

[...]

file="/app/pkgs/helpers/merkle/merkle.go:45"
time="2024-09-22T16:11:43Z" level=debug msg="Fetched 392 keys for epoch 1024" func=collector/pkgs/helpers/merkle.BuildBatchSubmissions file="/app/pkgs/helpers/merkle/merkle.go:49"
time="2024-09-22T16:11:43Z" level=debug msg="Arranged keys in batches: " func=collector/pkgs/helpers/merkle.BuildBatchSubmissions file="/app/pkgs/helpers/merkle/merkle.go:67"
time="2024-09-22T16:11:43Z" level=debug msg="[[1024.pairContract_trade_volume:0xa2107fa5b38d9bbd2c461d6edf11b11a50f6b974:UNISWAPV2.401[...]

@anomit
Copy link
Member Author

anomit commented Oct 3, 2024

The monolith is being refactored on the lines of https://github.com/PowerLoom/libp2p-submission-sequencer-listener and https://github.com/PowerLoom/sequencer-dequeuer to decouple and parallelize the workloads of

  • CID finalization, and building merkle trees of each individual batch
  • commit to an external transaction relay service to commit the batches

@anomit
Copy link
Member Author

anomit commented Oct 7, 2024

The specific work of CIDs finalized within batches and their merkle tree building is being refactored into a new component which will be deployed in staging and put to the test for the next couple of days.

Also as noted in this issue comment , a lot of the issues detailed in my last report on this thread will be eliminated once we move away from this component.

@anomit anomit closed this as completed Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants