-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] OOM (out of memory) recurring every 8-9 days #579
Comments
Hi @interfan7 , How many metrics (ie whisper files) does this instance serve? |
PS: you can enable pprof interface in config, then you csn run heap dumps and investigate them with go pprof command. |
@deniszh I can count how many are updated or accessed in the last 24 hours if that might help? |
@deniszh |
That's not much. I can check our prod memory consumption to compare. OTOH we're using trie index and trigram is disabled iirc. |
@deniszh Would you mind to tell whether anything interesting/suspicous is observable from it? Once we've configured it to be the target of our whole prod, it takes |
@interfan7 : that's a memory snapshot, and one snapshot doesn't give you much info.
Defaults are less strict, but your numbers unusually high. |
How do you see that? There are various ways to take service's/processe's mem occupation.
I think when we've set up the node, the Grafana users have complained they have lacked data or metrics in result, so changing this value has seemed to resolve it. However we've just set a very high value without graduallity of try-and-see cycles. I'll get heap profiles of 2 more points in time between service's start and "end" (i.e. somewhat before OOM). I've read pprof is capable of comparing. |
Hi @interfan7, |
@flucrezia I've decreased the 2 params mentioned above about 2 days ago and I want to see whether the memory will grow to 100GB+ again. If conclude reducing those params doesn't resolve the issue, at least not for 128GB machine, then I may try your suggestion 🙏🏻 |
Should be fixed in v0.18.0 |
Describe the bug
When the service is killed by the OS due to OOM, the systemd automatically starts it again.
Then, the memory consumption in the machine steadily increases for 8-9 days until next OOM.
Logs
I've not noticed something too particular in logs. The OOM log appears in system logs (demsg etc...).
I'll be happy to provide specific grep/messages, otherwise the log is huge.
Go-carbon Configuration:
go-carbon.conf:
storage-schemas.conf:
storage-aggregation.conf files:
I wonder whether fields
max-size
,max-metrics-globbed
ormax-metrics-rendered
have to do with the issue.Additional context
carbonapi
service also runs in same server.We've an identical dev server, but it's
carbonapi
is almost not queried.Interestingly we don't have that issue in the dev server, which suggest the issue has to do with queries.
Here is the memory usage graph for prod (left) and dev (right), side by side, for a period of 22 days:
In addition, the systemd status also indicates considerable different, although the prod service is active for only about 1.5 day.
Dev:
Prod:
Although that shall make sense since there are almost zero queries from the dev server.
The text was updated successfully, but these errors were encountered: