Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Living Ticket] Scalability related efforts #621

Open
1 of 11 tasks
okdas opened this issue Jun 18, 2024 · 4 comments
Open
1 of 11 tasks

[Living Ticket] Scalability related efforts #621

okdas opened this issue Jun 18, 2024 · 4 comments

Comments

@okdas
Copy link
Member

okdas commented Jun 18, 2024

Objective

Ensure that Shannon scales both on-chain & off-chain.

Origin Document

This issue is intended to be a living document to keep track of all related efforts.

Identified issues and points of investigation

  • RelayMiner RAM usage: [Relay Miner] Address high memory usage #551
  • AppGateServer (and Gateway) CPU #Infrastructure
    • Relays are not going through, and CPU utilization is up to the limit. Need to get pprof snapshots & evaluate.
  • Validator scalability
    • Ensure the validator's resource usage (CPU, RAM, etc...) is reasonable when the number of claims & proofs grows VERY LARGE. Note: This is why we need distribution of claims & proof
    • Probabilistic Proofs: make sure these parameters are adjusted properly for both: #Algorithmic
      • Validator scalability #Infrastructure
      • Discourding adversarial actors #Algorithmic
  • Relay Mining #Algorithmic #Infrastructure
    • Ensure gateway consumption is reasonable #Infrastructure
    • Ensure relayminer footprint is as small as possible #Infrastructure

Things to investigate:

  • Replacing the KV store in the SMT (e.g. BadgerDB or other)
  • Keeping things in memory or flushing to disk
  • Changing parameters

Creator: @okdas
Co-Owners: @red-0ne @bryanchriswhite @Olshansk

@okdas okdas added this to the Shannon Beta TestNet Launch milestone Jun 18, 2024
@okdas okdas self-assigned this Jun 18, 2024
@Olshansk Olshansk changed the title [Scalability] Living ticket: tracking related efforts [Living Ticket] Scalability related efforts Jun 19, 2024
@Olshansk
Copy link
Member

@okdas Made some changes, updates & improements to this ticket. PTAL

@okdas
Copy link
Member Author

okdas commented Aug 17, 2024

To investigate - ran into a panic - we potentially not handling the error from the RPC gracefully:

Panic Error

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3b15c58]

Goroutine Stack Trace

goroutine 297 [running]:
github.com/pokt-network/poktroll/pkg/relayer/session.(*sessionTree).Delete(0x4003e33040)
    /Users/dk/pocket/poktroll/pkg/relayer/session/sessiontree.go:250 +0xc8

github.com/pokt-network/poktroll/pkg/relayer/session.(*relayerSessionsManager).deleteExpiredSessionTreesFn.func1({0x51e5e00, 0x4000c07830}, {0x4001675aa0, 0x1, 0x1})
    /Users/dk/pocket/poktroll/pkg/relayer/session/session.go:456 +0x278

github.com/pokt-network/poktroll/pkg/observable/channel.ForEach[...].func1({0x4001675aa0, 0x1, 0x1})
    /Users/dk/pocket/poktroll/pkg/observable/channel/map.go:103 +0x6c

github.com/pokt-network/poktroll/pkg/observable/channel.goMapTransformNotification[...]({0x51e5e00, 0x4000c07830}, {0x51df2b0, 0x400157b620}, 0x40012bd008, 0x40012bd050, 0x40012da480)
    /Users/dk/pocket/poktroll/pkg/observable/channel/map.go:125 +0xc4

created by github.com/pokt-network/poktroll/pkg/observable/channel.Map[...] in goroutine 1
    /Users/dk/pocket/poktroll/pkg/observable/channel/map.go:24 +0x318

Related Log Messages

2024-08-16 17:19:22.783    {"level":"debug","message":"deleting expired session"}

2024-08-16 17:19:22.781    {"level":"error","error":"with hash: a451156fe642c5f425af9bc1818ae423307789be0a4c581d26621f7fc698a419: error in json rpc client, with http response metadata: (Status: 200 OK, Protocol HTTP/1.1). RPC error -32603 - Internal error: tx (A451156FE642C5F425AF9BC1818AE423307789BE0A4C581D26621F7FC698A419) not found: error encountered while querying for tx","message":"failed to create claims"}

2024-08-16 17:19:22.783    {"level":"error","error":"with hash: a451156fe642c5f425af9bc1818ae423307789be0a4c581d26621f7fc698a419: error in json rpc client, with http response metadata: (Status: 200 OK, Protocol HTTP/1.1). RPC error -32603 - Internal error: tx (A451156FE642C5F425AF9BC1818AE423307789BE0A4C581D26621F7FC698A419) not found: error encountered while querying for tx"}

@okdas
Copy link
Member Author

okdas commented Aug 17, 2024

To investigate. Given the nature of RelayMiner we need it to try to recover first.

RelayMiner stops on:
{"level":"error","work_name":"goPublishEvents","error":"eventsqueryclient connection closed","message":"on retry: 1"}

@Olshansk
Copy link
Member

@okdas This is related to the observable, so I think we may be reaching a place where:

  1. A deadlock happens (or something mutex related)
  2. The observable is blocked on events (either empty or too many)

Do you mind created a dedicated ticket to your comment here for @bryanchriswhite to tackle?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏗 In progress
Development

No branches or pull requests

2 participants