-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Investigate memory leak in single-binary deployment #3991
Comments
look at prometheus metrics, educated guess from dan - every time a task runs, prometheus metrics in propeller might be contributing to memory usage. |
Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. |
This looks a lot like what I'm seeing as well. See https://flyte-org.slack.com/archives/CP2HDHKE1/p1723819039330009 for more discussion. Is there a way to disable prometheus metrics to test this? I haven't found a configuration option yet... would deploying via flyte-core help here? Since this issue is specifically for |
After some investigation, it turns out that accumulating metrics data is the expected behavior of the prometheus golang client, as per prometheus/client_golang#920. We emit high-cardinality metrics in flytepropeller, as an example you can see here the emitted metrics after running a few hundred workflow executions. Note how we maintain a lot of metrics per execution id. This is the symptom described in the prometheus client gh issue. One could argue that emitting metrics per-execution id is the wrong granularity, but that's exactly what we do in the case of single-binary.
Knowing what we know now, the set of default metrics emitted by flytepropeller in the default cause does not cause the issue. Similar argument can be made to flyteadmin. In other words, this was a single-binary issue-only. Here's the PR to remove that label from single binary metrics: #5704 |
Related to #5606 |
This has solved the issue at a shorter time horizon. However I still see the issue on the longer time horizon. |
Describe the bug
A user in the OSS slack reported that single-binary pods keep getting killed due to OOM errors. Here's a graph of memory consumption in this case (different colors refer to different pods):
They are running stock Flyte v1.8.1 release.
Expected behavior
Memory consumption should remain relatively constant.
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: