Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some strange things while OOM #283

Open
wwng2333 opened this issue Nov 13, 2024 · 6 comments
Open

Some strange things while OOM #283

wwng2333 opened this issue Nov 13, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@wwng2333
Copy link
Contributor

Version of hub/agent: 0.8.0
1.My machine was oom during that time(about 1 hour30m), and the alert was sent over 100 emails. Maybe we need a limit?
image

2.The disk I/O speed is not normal while oom, that may cause of incurrent kernel status.
image

@wwng2333 wwng2333 changed the title Some strange thigns while OOM Some strange things while OOM Nov 13, 2024
@henrygd
Copy link
Owner

henrygd commented Nov 13, 2024

Was the machine running the hub OOM or was it a remote agent?

  1. This should not happen. If it was fully down for the whole period, only one status alert should be triggered. I'll look into it to see if there's a bug.

  2. I'll add a check for this. If values are that extreme then we can assume that something is wrong and reset the stats.

@wwng2333
Copy link
Contributor Author

Was the machine running the hub OOM or was it a remote agent?

The machine just running the agent, hub running at another machine.

  1. This should not happen. If it was fully down for the whole period, only one status alert should be triggered. I'll look into it to see if there's a bug.

Yep, I got 120+ mails report that machine was down, but no mail say it's up, that's why i feel strange.
image

  1. I'll add a check for this. If values are that extreme then we can assume that something is wrong and reset the stats.

OK, thank you sir.

@henrygd henrygd added the bug Something isn't working label Nov 14, 2024
@henrygd
Copy link
Owner

henrygd commented Nov 14, 2024

Thank you. Can you please tell me if the notifications were all sent at the same time? Or were they spaced out throughout the downtime?

@wwng2333
Copy link
Contributor Author

Thank you. Can you please tell me if the notifications were all sent at the same time? Or were they spaced out throughout the downtime?

I will show you the smtp record under.
image

@wwng2333
Copy link
Contributor Author

wwng2333 commented Nov 17, 2024

I met same situation(lots of mail about down) at another vm, the problem may caused by the bad network connection between US and China. The vm running normal without any omm situation, and the agent was running in docker.
After i pause and unpause the agent, it works again.
I received about 13 emails for that.
image

Update: Issue resolved.The problem was caused by network, i use hostname to resolve that host, and the host was added by IPv6 last night, i added the AAAA record for that, and i forgot to allow firewall for port 45876 via IPv6, but IPv4 works fine.After DNS update, the hub try to connect my agent via IPv6, then it's not able to connect. After i allow port at firewall, it automatic recovered.
Problem: If the hostname is pointing both IPv4 and IPv6 host, should the hub connnect them both via IPv4 and IPv6?

@henrygd
Copy link
Owner

henrygd commented Nov 17, 2024

I think by default it should fall back to ipv4 if ipv6 doesn't work, but I need to look further into it. Maybe there's a conflict with how we're handling the errors.

I'm very sorry about the emails. This definitely looks like a bug, but I haven't had time to work on it. I should have time in the next few days. I want to redo the status alerts anyway to allow specifying a time period like the other alerts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants