[BUG] Investigate memory leak in single-binary deployment #3991

eapolinario · 2023-08-26T17:15:17Z

Describe the bug

A user in the OSS slack reported that single-binary pods keep getting killed due to OOM errors. Here's a graph of memory consumption in this case (different colors refer to different pods):

They are running stock Flyte v1.8.1 release.

Expected behavior

Memory consumption should remain relatively constant.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

wild-endeavor · 2023-09-01T17:39:41Z

look at prometheus metrics, educated guess from dan - every time a task runs, prometheus metrics in propeller might be contributing to memory usage.

github-actions · 2024-06-22T00:08:27Z

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

cpaulik · 2024-08-28T13:53:13Z

This looks a lot like what I'm seeing as well. See https://flyte-org.slack.com/archives/CP2HDHKE1/p1723819039330009 for more discussion.

Is there a way to disable prometheus metrics to test this? I haven't found a configuration option yet...

would deploying via flyte-core help here? Since this issue is specifically for single-binary deployment.

eapolinario · 2024-08-29T14:55:42Z

After some investigation, it turns out that accumulating metrics data is the expected behavior of the prometheus golang client, as per prometheus/client_golang#920. We emit high-cardinality metrics in flytepropeller, as an example you can see here the emitted metrics after running a few hundred workflow executions. Note how we maintain a lot of metrics per execution id. This is the symptom described in the prometheus client gh issue.

One could argue that emitting metrics per-execution id is the wrong granularity, but that's exactly what we do in the case of single-binary.

would deploying via flyte-core help here?

Knowing what we know now, the set of default metrics emitted by flytepropeller in the default cause does not cause the issue. Similar argument can be made to flyteadmin. In other words, this was a single-binary issue-only.

Here's the PR to remove that label from single binary metrics: #5704

wild-endeavor · 2024-08-29T15:06:38Z

Related to #5606

divyank000 · 2024-09-30T13:53:20Z

This has solved the issue at a shorter time horizon. However I still see the issue on the longer time horizon.
initially for us, it was around few hours on production workload. Now it is around 2 days on production workload.

eapolinario added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Aug 26, 2023

eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Sep 8, 2023

eapolinario self-assigned this Sep 8, 2023

eapolinario added the backlogged For internal use. Reserved for contributor team workflow. label Sep 25, 2023

github-actions bot added the stale label Jun 22, 2024

eapolinario mentioned this issue Aug 29, 2024

Do not emit execution id label by default in single binary #5704

Merged

3 tasks

eapolinario closed this as completed in #5704 Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Investigate memory leak in single-binary deployment #3991

[BUG] Investigate memory leak in single-binary deployment #3991

eapolinario commented Aug 26, 2023 •

edited

Loading

wild-endeavor commented Sep 1, 2023

github-actions bot commented Jun 22, 2024

cpaulik commented Aug 28, 2024 •

edited

Loading

eapolinario commented Aug 29, 2024

wild-endeavor commented Aug 29, 2024

divyank000 commented Sep 30, 2024

[BUG] Investigate memory leak in single-binary deployment #3991

[BUG] Investigate memory leak in single-binary deployment #3991

Comments

eapolinario commented Aug 26, 2023 • edited Loading

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

wild-endeavor commented Sep 1, 2023

github-actions bot commented Jun 22, 2024

cpaulik commented Aug 28, 2024 • edited Loading

eapolinario commented Aug 29, 2024

wild-endeavor commented Aug 29, 2024

divyank000 commented Sep 30, 2024

eapolinario commented Aug 26, 2023 •

edited

Loading

cpaulik commented Aug 28, 2024 •

edited

Loading