Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Off-chain actors become unresponsive and consume CPU #640

Open
7 tasks
okdas opened this issue Jun 28, 2024 · 2 comments
Open
7 tasks

[Bug] Off-chain actors become unresponsive and consume CPU #640

okdas opened this issue Jun 28, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@okdas
Copy link
Member

okdas commented Jun 28, 2024

Objective

Investigate and resolve intermittent high CPU usage in AppGate (including Gateway) and RelayMiner causing unresponsiveness and E2E test disruptions.

Origin Document

This issue has been observed during E2E testing, where AppGate and RelayMiner occasionally become unresponsive and consume 100% of their allocated CPU resources. This behavior frequently disrupts our E2E tests.

Related to #621

The pprof snapshots are included in the comment below.

Goals

  • Identify the root cause of the high CPU usage in AppGate and RelayMiner
  • Implement a solution to prevent or mitigate the unresponsiveness issue
  • Improve the stability and reliability of our E2E testing environment

Deliverables

  • Evaluate the pprof snapshots
  • If necessary, add more debug logging output and metrics to help catch the issue
  • Merge in a fix and monitor for this behavior in furute

General deliverables

  • Comments: Add/update TODOs and comments alongside the source code so it is easier to follow.
  • Testing: Add new tests (unit and/or E2E) to the test suite.
  • Makefile: Add new targets to the Makefile to make the new functionality easier to use.
  • Documentation: Update architectural or development READMEs; use mermaid diagrams where appropriate.

Creator: @okdas
Co-Owners: @red-0ne

@okdas okdas added the bug Something isn't working label Jun 28, 2024
@okdas okdas added this to the Shannon Beta TestNet Launch milestone Jun 28, 2024
@okdas
Copy link
Member Author

okdas commented Jun 28, 2024

A zip file with pprof snapshots from an appgate that experienced this issue: appgate.zip

Each pprof file can be opened like this:

go tool pprof -http=:3333 block_profile.pprof

@okdas
Copy link
Member Author

okdas commented Jul 12, 2024

I just observed a DevNet not handling relays, but also not using up all available CPU. That means the symptoms are not related. I'll keep that ticket for CPU consumption and open a separate one for hanged relays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 🔖 Ready
Development

No branches or pull requests

2 participants