-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow /api/v1/jobs endpoint after upgrading to 0.50.0 #2987
Comments
Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template! |
Probably there is a missing index:
Query runs 4x faster than before, but it still takes 3 minutes for each job:
Is there any need to join dataset_facets, input & output versions only to show the list of run names and the timestamp + state of the latest run within each job? |
We're experiencing the same slowness as well - our preliminary analysis seemed to second the above |
We're facing the same slowness. The reason appears to be that we migrated from version 0.47.0 to 0.50.0, keeping the same database. We tested a testing deployment using a brand-new RDS, and the performance was significantly better. |
I've tried to start from the fresh database, but after consuming the same amount of lineage data, the slowness appeared again. I have to use 0.49.0 to fix that, on a fresh database (because I didn't get how to downgrade flyway migrations). |
We tested deploying a new RDS Postgres base with the latest Marquez release and it worked initially but when it got up to 6k jobs, it became slow again. All other queries not related to jobs are pretty fast though. I wonder if is the query format that is causing issues with the performance. |
When it comes to the JobResource.java class, I compared it with versions 0.47 and 0.50 and got some interesing findings. The new code always loads the jobs plus their runs (e.g., via findAllWithRun and the defaulting of lastRunStates). That extra join/query fetches a lot more data when you have thousands of jobs, like it is in my case. Maybe, if we could avoid loading runs inline or make the run fetching optional/paginated—e.g., limit the run query, create indexes for filtering, or remove the extra joins unless explicitly requested, it would work faster. |
Looking at the code again, I don't believe there's a need to let the |
Hi.
I've rolled out new instance of API + web + db containers on a host using
./docker/up.sh
, and set up several Spark sessions & Airflow instances to send events to API.Opening Web pages like
/datasets
or/events
are fast enough (~1s to open), but the/jobs
and list of job runs on the main page takes tens of minutes to load.This is caused by slow SELECT query in the database, which takes about ~7 minutes for each job:
Here is an EXPLAIN
See graphics representation.
dataset_facets
table is ~22Gb, so sequential scan is very inefficient here.The text was updated successfully, but these errors were encountered: