Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Performance degradation with datacatalog.tags table #4583

Open
2 tasks done
andrew-freenome opened this issue Dec 12, 2023 · 6 comments
Open
2 tasks done

[BUG] Performance degradation with datacatalog.tags table #4583

andrew-freenome opened this issue Dec 12, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@andrew-freenome
Copy link

andrew-freenome commented Dec 12, 2023

Describe the bug

I am seeing a performance bottleneck with the Flyte database. With my workload, the query SELECT * FROM "tags" WHERE ("tags"."artifact_id","tags"."dataset_uuid") IN (($1,$2)) is getting executed frequently (380,000 times in the last day) against the datacatalog database. The workload I'm executing does have ~380k tasks, so the number of queries makes sense. On average, it takes 2 seconds to complete and returns 0 rows. I believe this query is executed as part of the task cache, which I do have enabled (but I expect all cache misses). I am using v1.0.1 of Datacatalog, and v1.1.47 of FlytePropeller. This is a database with 32 vCPU and 64GB of memory, 200GB storage with ~50GB of that used.
Screenshot 2023-12-05 at 12 36 55 PM

Expected behavior

I expect that the database would not be a performance bottleneck, and that the datacatalog.tags table is properly indexed in order to support the queries that are executed against it.

Additional context to reproduce

If I add an index (CREATE INDEX tags_dataset_uuid_artifact_id_idx ON tags (dataset_uuid, artifact_id);), the query gets significantly faster (1000x in my local testing). The postgres planner seems to vary the way it actually executes the query. The attached screenshots are from a different DB instance under slightly lesser load, but I'm adding them to show the 3 different ways I've seen the planner choose to execute the query. (Note that these are all without the index).

Context: https://flyte-org.slack.com/archives/CP2HDHKE1/p1701797879836439

Screenshots

Screenshot 2023-12-06 at 2 03 33 PM Screenshot 2023-12-06 at 2 03 45 PM Screenshot 2023-12-06 at 2 03 57 PM

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@andrew-freenome andrew-freenome added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Dec 12, 2023
Copy link

welcome bot commented Dec 12, 2023

Thank you for opening your first issue here! 🛠

Copy link

dosubot bot commented Dec 12, 2023

🤖

Hello @andrew-freenome! I'm Dosu, a friendly bot here to assist you with your issues, answer your queries and help you contribute to the Flyte community. While we're waiting for a human maintainer, I'll be working on your issue about the performance degradation with the datacatalog.tags table. I'll get back to you with a comprehensive answer shortly. Hang tight!

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Dec 21, 2023
@eapolinario
Copy link
Contributor

@andrew-freenome , we're investing in a much simpler and more performant implementation of datacatalog. It's bound to show up in Q124.

@andrew-freenome
Copy link
Author

Would an MR that created the missing index be welcomed, or since the alternate implementation you mentioned will obviate the need for it, would it not get merged?

@annadcunningham
Copy link

Is there any update on this issue? A fix was predicted for Q1 2024 but we haven't seen any updates in the release notes about datacatalog. Thanks!

@eapolinario
Copy link
Contributor

@annadcunningham , unfortunately this project had to be de-prioritized.

@andrew-freenome , would you be willing to contribute this change to create the missing index?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants