Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow /api/v1/jobs endpoint after upgrading to 0.50.0 #2987

Open
dolfinus opened this issue Dec 12, 2024 · 8 comments
Open

Very slow /api/v1/jobs endpoint after upgrading to 0.50.0 #2987

dolfinus opened this issue Dec 12, 2024 · 8 comments

Comments

@dolfinus
Copy link
Contributor

dolfinus commented Dec 12, 2024

Hi.

I've rolled out new instance of API + web + db containers on a host using ./docker/up.sh, and set up several Spark sessions & Airflow instances to send events to API.
Opening Web pages like /datasets or /events are fast enough (~1s to open), but the /jobs and list of job runs on the main page takes tens of minutes to load.

This is caused by slow SELECT query in the database, which takes about ~7 minutes for each job:

TRACE [2024-12-12 08:25:49,175] org.postgresql.core.v3.QueryExecutorImpl:   simple execute, handler=org.postgresql.jdbc.PgStatement$StatementResultHandler@444b38b1, maxRows=0, fetchSize=0, flags=17
TRACE [2024-12-12 08:25:49,175] org.postgresql.core.v3.QueryExecutorImpl:  FE=> Parse(stmt=null,query="/* RunDao.findByLatestJob */ SELECT r.*, ra.args, f.facets,
jv.version AS job_version,
ri.input_versions, ro.output_versions, df.dataset_facets
FROM runs_view AS r
LEFT OUTER JOIN
(
    SELECT rf.run_uuid, JSON_AGG(rf.facet ORDER BY rf.lineage_event_time ASC) AS facets
    FROM run_facets_view rf
    GROUP BY rf.run_uuid
) AS f ON r.uuid=f.run_uuid
LEFT OUTER JOIN run_args AS ra ON ra.uuid = r.run_args_uuid
LEFT OUTER JOIN job_versions jv ON jv.uuid=r.job_version_uuid
LEFT OUTER JOIN (
    SELECT im.run_uuid, JSON_AGG(json_build_object('namespace', dv.namespace_name,
        'name', dv.dataset_name,
        'version', dv.version,
        'dataset_version_uuid', uuid)) AS input_versions
    FROM runs_input_mapping im
    INNER JOIN dataset_versions dv on im.dataset_version_uuid = dv.uuid
    GROUP BY im.run_uuid
) ri ON ri.run_uuid=r.uuid
LEFT OUTER JOIN (
    SELECT run_uuid, JSON_AGG(json_build_object('namespace', namespace_name,
                                                'name', dataset_name,
                                                'version', version,
                                                'dataset_version_uuid', uuid
                                                )) AS output_versions
    FROM dataset_versions
    GROUP BY run_uuid
) ro ON ro.run_uuid=r.uuid
LEFT OUTER JOIN (
    SELECT
        run_uuid,
        JSON_AGG(json_build_object(
            'dataset_version_uuid', dataset_version_uuid,
            'name', name,
            'type', type,
            'facet', facet
        ) ORDER BY created_at ASC) as dataset_facets
    FROM dataset_facets_view
    WHERE (type ILIKE 'output' OR type ILIKE 'input')
    GROUP BY run_uuid
) AS df ON r.uuid = df.run_uuid
WHERE r.uuid IN (
  SELECT r.uuid FROM runs_view r
  INNER JOIN jobs_view j ON j.namespace_name=r.namespace_name AND j.name=r.job_name
  WHERE j.namespace_name=$1 AND (j.name=$2 OR j.name=ANY(j.aliases))
)
ORDER BY transitioned_at DESC, started_at DESC
LIMIT $3 OFFSET $4
",oids={1043,1043,23,23})

Here is an EXPLAIN

                                                                                                                                                                              QUERY PLAN                                                   
                                                                                                                           
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3775151.66..3775151.66 rows=1 width=652) (actual time=412291.322..412291.525 rows=1 loops=1)
   Buffers: shared hit=5346293 read=1675785, temp read=3523811 written=3523811
   I/O Timings: read=6257.602
   ->  Sort  (cost=3775151.66..3775151.69 rows=13 width=652) (actual time=410508.526..410508.728 rows=1 loops=1)
         Sort Key: r.transitioned_at DESC, r.started_at DESC
         Sort Method: top-N heapsort  Memory: 101kB
         Buffers: shared hit=5346293 read=1675785, temp read=3523811 written=3523811
         I/O Timings: read=6257.602
         ->  Hash Semi Join  (cost=3588224.19..3775151.60 rows=13 width=652) (actual time=145217.631..410501.532 rows=400 loops=1)
               Hash Cond: (r.uuid = r_1.uuid)
               Buffers: shared hit=5346293 read=1675785, temp read=3523811 written=3523811
               I/O Timings: read=6257.602
               ->  Hash Left Join  (cost=3586718.79..3767411.08 rows=2375226 width=822) (actual time=135998.815..410101.882 rows=2370295 loops=1)
                     Hash Cond: (r.uuid = df.run_uuid)
                     Buffers: shared hit=5345605 read=1675041, temp read=3523811 written=3523811
                     I/O Timings: read=6254.450
                     ->  Hash Left Join  (cost=2209892.58..2384349.89 rows=2375226 width=790) (actual time=127880.676..359439.191 rows=2370295 loops=1)
                           Hash Cond: (r.uuid = ro.run_uuid)
                           Buffers: shared hit=5345004 read=451236, temp read=2792364 written=2792364
                           I/O Timings: read=2700.512
                           ->  Hash Left Join  (cost=2095454.72..2263677.06 rows=2375226 width=758) (actual time=126016.827..288634.048 rows=2370295 loops=1)
                                 Hash Cond: (r.uuid = ri.run_uuid)
                                 Buffers: shared hit=5087798 read=451236, temp read=2066296 written=2066296
                                 I/O Timings: read=2700.512
                                 ->  Hash Left Join  (cost=2061253.44..2223240.80 rows=2375226 width=726) (actual time=125028.052..285765.204 rows=2370295 loops=1)
                                       Hash Cond: (r.job_version_uuid = jv.uuid)
                                       Buffers: shared hit=4928932 read=451236, temp read=2066296 written=2066296
                                       I/O Timings: read=2700.512
                                       ->  Hash Left Join  (cost=2060896.77..2216645.00 rows=2375226 width=710) (actual time=125025.614..284599.062 rows=2370295 loops=1)
                                             Hash Cond: (r.run_args_uuid = ra.uuid)
                                             Buffers: shared hit=4928716 read=451236, temp read=2066296 written=2066296
                                             I/O Timings: read=2700.512
                                             ->  Hash Left Join  (cost=2057304.83..2206817.65 rows=2375226 width=528) (actual time=124998.131..282991.226 rows=2370295 loops=1)
                                                   Hash Cond: (r.uuid = f.run_uuid)
                                                   Buffers: shared hit=4926464 read=451236, temp read=2066296 written=2066296
                                                   I/O Timings: read=2700.512
                                                   ->  Hash Left Join  (cost=942.55..144220.38 rows=2375226 width=496) (actual time=4.340..4013.423 rows=2370295 loops=1)
                                                         Hash Cond: (j.symlink_target_uuid = j_1.uuid)
                                                         Buffers: shared hit=731 read=107006
                                                         I/O Timings: read=688.355
                                                         ->  Hash Join  (cost=471.27..137514.14 rows=2375226 width=395) (actual time=2.249..3042.805 rows=2370295 loops=1)
                                                               Hash Cond: (r.job_uuid = j.uuid)
                                                               Buffers: shared hit=388 read=107006
                                                               I/O Timings: read=688.355
                                                               ->  Seq Scan on runs r  (cost=0.00..130803.26 rows=2375226 width=262) (actual time=0.007..1434.867 rows=2370295 loops=1)
                                                                     Buffers: shared hit=45 read=107006
                                                                     I/O Timings: read=688.355
                                                               ->  Hash  (cost=400.01..400.01 rows=5701 width=149) (actual time=2.207..2.212 rows=5103 loops=1)
                                                                     Buckets: 8192  Batches: 1  Memory Usage: 889kB
                                                                     Buffers: shared hit=343
                                                                     ->  Seq Scan on jobs j  (cost=0.00..400.01 rows=5701 width=149) (actual time=0.011..1.052 rows=5103 loops=1)
                                                                           Buffers: shared hit=343
                                                         ->  Hash  (cost=400.01..400.01 rows=5701 width=149) (actual time=2.064..2.067 rows=5103 loops=1)
                                                               Buckets: 8192  Batches: 1  Memory Usage: 958kB
                                                               Buffers: shared hit=343
                                                               ->  Seq Scan on jobs j_1  (cost=0.00..400.01 rows=5701 width=149) (actual time=0.010..0.973 rows=5103 loops=1)
                                                                     Filter: ((is_hidden IS FALSE) AND (symlink_target_uuid IS NULL))
                                                                     Buffers: shared hit=343
                                                   ->  Hash  (cost=2055071.46..2055071.46 rows=103266 width=48) (actual time=124990.710..124990.715 rows=2344707 loops=1)
                                                         Buckets: 131072 (originally 131072)  Batches: 256 (originally 1)  Memory Usage: 64541kB
                                                         Buffers: shared hit=4925733 read=344230, temp written=1282476
                                                         I/O Timings: read=2012.158
                                                         ->  Subquery Scan on f  (cost=0.44..2055071.46 rows=103266 width=48) (actual time=0.132..90842.612 rows=2344707 loops=1)
                                                               Buffers: shared hit=4925733 read=344230
                                                               I/O Timings: read=2012.158
                                                               ->  GroupAggregate  (cost=0.44..2054038.80 rows=103266 width=48) (actual time=0.129..90393.606 rows=2344707 loops=1)
                                                                     Group Key: run_facets.run_uuid
                                                                     Buffers: shared hit=4925733 read=344230
                                                                     I/O Timings: read=2012.158
                                                                     ->  Index Scan using run_facets_run_uuid_idx on run_facets  (cost=0.44..1948491.64 rows=20851267 width=490) (actual time=0.041..11344.015 rows=20986865 loops=1)
                                                                           Buffers: shared hit=4924823 read=344130
                                                                           I/O Timings: read=2011.548
                                             ->  Hash  (cost=2847.53..2847.53 rows=59553 width=198) (actual time=27.315..27.316 rows=59518 loops=1)
                                                   Buckets: 65536  Batches: 1  Memory Usage: 13933kB
                                                   Buffers: shared hit=2252
                                                   ->  Seq Scan on run_args ra  (cost=0.00..2847.53 rows=59553 width=198) (actual time=0.016..12.934 rows=59518 loops=1)
                                                         Buffers: shared hit=2252
                                       ->  Hash  (cost=278.52..278.52 rows=6252 width=32) (actual time=2.393..2.394 rows=6194 loops=1)
                                             Buckets: 8192  Batches: 1  Memory Usage: 452kB
                                             Buffers: shared hit=216
                                             ->  Seq Scan on job_versions jv  (cost=0.00..278.52 rows=6252 width=32) (actual time=0.017..1.286 rows=6194 loops=1)
                                                   Buffers: shared hit=216
                                 ->  Hash  (cost=33749.22..33749.22 rows=36165 width=48) (actual time=988.630..988.642 rows=58795 loops=1)
                                       Buckets: 65536  Batches: 1  Memory Usage: 52475kB
                                       Buffers: shared hit=158866
                                       ->  Subquery Scan on ri  (cost=0.85..33749.22 rows=36165 width=48) (actual time=0.206..938.661 rows=58795 loops=1)
                                             Buffers: shared hit=158866
                                             ->  GroupAggregate  (cost=0.85..33387.57 rows=36165 width=48) (actual time=0.201..932.194 rows=58795 loops=1)
                                                   Group Key: im.run_uuid
                                                   Buffers: shared hit=158866
                                                   ->  Nested Loop  (cost=0.85..31451.59 rows=197855 width=150) (actual time=0.142..302.706 rows=197521 loops=1)
                                                         Buffers: shared hit=158866
                                                         ->  Index Only Scan using runs_input_mapping_pkey on runs_input_mapping im  (cost=0.42..7899.56 rows=197855 width=32) (actual time=0.077..26.635 rows=197521 loops=1)
                                                               Heap Fetches: 7393
                                                               Buffers: shared hit=2394
                                                         ->  Memoize  (cost=0.43..1.82 rows=1 width=134) (actual time=0.001..0.001 rows=1 loops=197521)
                                                               Cache Key: im.dataset_version_uuid
                                                               Cache Mode: logical
                                                               Hits: 158403  Misses: 39118  Evictions: 0  Overflows: 0  Memory Usage: 10004kB
                                                               Buffers: shared hit=156472
                                                               ->  Index Scan using dataset_versions_pkey on dataset_versions dv  (cost=0.42..1.81 rows=1 width=134) (actual time=0.004..0.004 rows=1 loops=39118)
                                                                     Index Cond: (uuid = im.dataset_version_uuid)
                                                                     Buffers: shared hit=156472
                           ->  Hash  (cost=109813.38..109813.38 rows=369958 width=48) (actual time=1862.014..1862.019 rows=368533 loops=1)
                                 Buckets: 524288 (originally 524288)  Batches: 2 (originally 1)  Memory Usage: 61441kB
                                 Buffers: shared hit=257206, temp written=6541
                                 ->  Subquery Scan on ro  (cost=0.42..109813.38 rows=369958 width=48) (actual time=0.061..1646.642 rows=368534 loops=1)
                                       Buffers: shared hit=257206
                                       ->  GroupAggregate  (cost=0.42..106113.80 rows=369958 width=48) (actual time=0.056..1612.401 rows=368534 loops=1)
                                             Group Key: dataset_versions.run_uuid
                                             Buffers: shared hit=257206
                                             ->  Index Scan using dataset_versions_run_uuid on dataset_versions  (cost=0.42..98701.81 rows=371668 width=150) (actual time=0.017..265.834 rows=369996 loops=1)
                                                   Buffers: shared hit=257206
                     ->  Hash  (cost=1376610.02..1376610.02 rows=17295 width=48) (actual time=8118.023..8118.145 rows=210444 loops=1)
                           Buckets: 131072 (originally 32768)  Batches: 2 (originally 1)  Memory Usage: 64513kB
                           Buffers: shared hit=601 read=1223805, temp written=6830
                           I/O Timings: read=3553.938
                           ->  Subquery Scan on df  (cost=1345694.14..1376610.02 rows=17295 width=48) (actual time=6158.772..7749.246 rows=210444 loops=1)
                                 Buffers: shared hit=601 read=1223805
                                 I/O Timings: read=3553.938
                                 ->  GroupAggregate  (cost=1345694.14..1376437.07 rows=17295 width=48) (actual time=6158.759..7724.066 rows=210444 loops=1)
                                       Group Key: dataset_facets.run_uuid
                                       Buffers: shared hit=601 read=1223805
                                       I/O Timings: read=3553.938
                                       ->  Gather Merge  (cost=1345694.14..1374374.01 rows=246250 width=692) (actual time=6158.600..6247.382 rows=256134 loops=1)
                                             Workers Planned: 2
                                             Workers Launched: 2
                                             Buffers: shared hit=601 read=1223805
                                             I/O Timings: read=3553.938
                                             ->  Sort  (cost=1344694.12..1344950.63 rows=102604 width=692) (actual time=6118.767..6135.501 rows=85378 loops=3)
                                                   Sort Key: dataset_facets.run_uuid
                                                   Sort Method: quicksort  Memory: 47063kB
                                                   Buffers: shared hit=601 read=1223805
                                                   I/O Timings: read=3553.938
                                                   Worker 0:  Sort Method: quicksort  Memory: 47876kB
                                                   Worker 1:  Sort Method: quicksort  Memory: 45679kB
                                                   ->  Parallel Seq Scan on dataset_facets  (cost=0.00..1304591.01 rows=102604 width=692) (actual time=134.365..6040.771 rows=85378 loops=3)
                                                         Filter: (((type)::text ~~* 'output'::text) OR ((type)::text ~~* 'input'::text))
                                                         Rows Removed by Filter: 4199211
                                                         Buffers: shared hit=487 read=1223805
                                                         I/O Timings: read=3553.938
               ->  Hash  (cost=1505.24..1505.24 rows=13 width=16) (actual time=10.053..10.070 rows=400 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 27kB
                     Buffers: shared hit=688 read=744
                     I/O Timings: read=3.152
                     ->  Nested Loop  (cost=471.70..1505.24 rows=13 width=16) (actual time=7.414..9.985 rows=400 loops=1)
                           Buffers: shared hit=688 read=744
                           I/O Timings: read=3.152
                           ->  Nested Loop  (cost=471.27..1388.65 rows=1 width=16) (actual time=7.367..7.380 rows=1 loops=1)
                                 Join Filter: ((COALESCE(j_3.name, j_2.name))::text = (j_4.name)::text)
                                 Rows Removed by Join Filter: 2950
                                 Buffers: shared hit=686 read=343
                                 I/O Timings: read=1.438
                                 ->  Hash Left Join  (cost=471.27..886.25 rows=29 width=250) (actual time=4.244..5.747 rows=2951 loops=1)
                                       Hash Cond: (j_2.symlink_target_uuid = j_3.uuid)
                                       Filter: ((COALESCE(j_3.namespace_name, j_2.namespace_name))::text = 'yarn://odin'::text)
                                       Rows Removed by Filter: 2152
                                       Buffers: shared hit=343 read=343
                                       I/O Timings: read=1.438
                                       ->  Seq Scan on jobs j_2  (cost=0.00..400.01 rows=5701 width=149) (actual time=0.033..0.927 rows=5103 loops=1)
                                             Buffers: shared hit=342 read=1
                                             I/O Timings: read=0.007
                                       ->  Hash  (cost=400.01..400.01 rows=5701 width=149) (actual time=4.169..4.171 rows=5103 loops=1)
                                             Buckets: 8192  Batches: 1  Memory Usage: 958kB
                                             Buffers: shared hit=1 read=342
                                             I/O Timings: read=1.431
                                             ->  Seq Scan on jobs j_3  (cost=0.00..400.01 rows=5701 width=149) (actual time=0.009..2.991 rows=5103 loops=1)
                                                   Filter: ((is_hidden IS FALSE) AND (symlink_target_uuid IS NULL))
                                                   Buffers: shared hit=1 read=342
                                                   I/O Timings: read=1.431
                                 ->  Materialize  (cost=0.00..499.81 rows=6 width=133) (actual time=0.000..0.000 rows=1 loops=2951)
                                       Buffers: shared hit=343
                                       ->  Seq Scan on jobs j_4  (cost=0.00..499.78 rows=6 width=133) (actual time=0.849..0.851 rows=1 loops=1)
                                             Filter: ((is_hidden IS FALSE) AND (symlink_target_uuid IS NULL) AND ((namespace_name)::text = 'yarn://odin'::text) AND (((name)::text = 'etl_coredl_dh_cdm_odin_agg_acc_mob_payment_d_agg_acc_mob_payment_d_frs_kbn.execute_insert_into_hadoop_fs_relation_command'::text) OR ((name)::text = ANY ((aliases)::text[]))))
                                             Rows Removed by Filter: 5102
                                             Buffers: shared hit=343
                           ->  Index Scan using runs_job_uuid on runs r_1  (cost=0.43..104.84 rows=1175 width=32) (actual time=0.037..2.527 rows=400 loops=1)
                                 Index Cond: (job_uuid = j_2.uuid)
                                 Buffers: shared hit=2 read=401
                                 I/O Timings: read=1.714
 Planning:
   Buffers: shared hit=79 read=33
   I/O Timings: read=0.241
 Planning Time: 12.341 ms
 JIT:
   Functions: 135
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 16.249 ms, Inlining 299.377 ms, Optimization 1222.107 ms, Emission 664.409 ms, Total 2202.141 ms
 Execution Time: 412316.058 ms
(189 rows)

See graphics representation.

dataset_facets table is ~22Gb, so sequential scan is very inefficient here.

Copy link

boring-cyborg bot commented Dec 12, 2024

Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template!

@dolfinus
Copy link
Contributor Author

dolfinus commented Dec 12, 2024

Probably there is a missing index:

CREATE INDEX CONCURRENTLY "dataset_facets-io-text" ON dataset_facets (type) WHERE type ILIKE 'output' OR type ILIKE 'input';

Query runs 4x faster than before, but it still takes 3 minutes for each job:

                                                                                                                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2547862.84..2547862.95 rows=1 width=652) (actual time=168425.958..168539.540 rows=1 loops=1)
   Buffers: shared hit=15861436 read=1886842, temp read=7519003 written=7519009
   I/O Timings: read=9355.593
   ->  Gather Merge  (cost=2547862.84..2547864.00 rows=10 width=652) (actual time=166759.437..166873.016 rows=1 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=15861436 read=1886842, temp read=7519003 written=7519009
         I/O Timings: read=9355.593
         ->  Sort  (cost=2546862.81..2546862.83 rows=5 width=652) (actual time=164182.523..164182.769 rows=1 loops=3)
               Sort Key: r.transitioned_at DESC, r.started_at DESC
               Sort Method: top-N heapsort  Memory: 141kB
               Buffers: shared hit=15861436 read=1886842, temp read=7519003 written=7519009
               I/O Timings: read=9355.593
               Worker 0:  Sort Method: top-N heapsort  Memory: 102kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 89kB
               ->  Hash Semi Join  (cost=2406529.95..2546862.79 rows=5 width=652) (actual time=132293.766..164180.085 rows=133 loops=3)
                     Hash Cond: (r.uuid = r_1.uuid)
                     Buffers: shared hit=15861423 read=1886841, temp read=7519003 written=7519009
                     I/O Timings: read=9355.547
                     ->  Hash Left Join  (cost=2405024.55..2542759.43 rows=989678 width=822) (actual time=125749.229..163000.666 rows=790098 loops=3)
                           Hash Cond: (r.uuid = df.run_uuid)
                           Buffers: shared hit=15857861 read=1886105, temp read=7519003 written=7519009
                           I/O Timings: read=9351.933
                           ->  Hash Left Join  (cost=2209340.83..2344477.80 rows=989678 width=790) (actual time=121101.805..153080.397 rows=790098 loops=3)
                                 Hash Cond: (r.uuid = ro.run_uuid)
                                 Buffers: shared hit=15475374 read=1694850, temp read=6737526 written=6737526
                                 I/O Timings: read=8303.268
                                 ->  Hash Left Join  (cost=2094902.98..2227442.04 rows=989678 width=758) (actual time=118915.675..146088.397 rows=790098 loops=3)
                                       Hash Cond: (r.uuid = ri.run_uuid)
                                       Buffers: shared hit=14750780 read=1647824, temp read=5998373 written=5998373
                                       I/O Timings: read=8093.330
                                       ->  Hash Left Join  (cost=2060701.70..2190642.86 rows=989678 width=726) (actual time=117877.032..144530.590 rows=790098 loops=3)
                                             Hash Cond: (r.job_version_uuid = jv.uuid)
                                             Buffers: shared hit=14297574 read=1624422, temp read=5998373 written=5998373
                                             I/O Timings: read=7986.778
                                             ->  Parallel Hash Left Join  (cost=2060345.03..2187686.55 rows=989678 width=710) (actual time=117874.166..144271.171 rows=790098 loops=3)
                                                   Hash Cond: (r.run_args_uuid = ra.uuid)
                                                   Buffers: shared hit=14297142 read=1624206, temp read=5998373 written=5998373
                                                   I/O Timings: read=7985.899
                                                   ->  Hash Left Join  (cost=2057304.83..2182048.26 rows=989678 width=528) (actual time=117859.820..143863.963 rows=790098 loops=3)
                                                         Hash Cond: (r.uuid = f.run_uuid)
                                                         Buffers: shared hit=14296609 read=1622487, temp read=5998373 written=5998373
                                                         I/O Timings: read=7977.586
                                                         ->  Hash Left Join  (cost=942.55..123088.07 rows=989678 width=496) (actual time=4.894..733.314 rows=790098 loops=3)
                                                               Hash Cond: (j.symlink_target_uuid = j_1.uuid)
                                                               Buffers: shared hit=2058 read=107051
                                                               I/O Timings: read=444.371
                                                               ->  Hash Join  (cost=471.27..120018.89 rows=989678 width=395) (actual time=2.601..567.341 rows=790098 loops=3)
                                                                     Hash Cond: (r.job_uuid = j.uuid)
                                                                     Buffers: shared hit=1029 read=107051
                                                                     I/O Timings: read=444.371
                                                                     ->  Parallel Seq Scan on runs r  (cost=0.00..116947.77 rows=989678 width=262) (actual time=0.024..282.185 rows=790098 loops=3)
                                                                           Buffers: shared read=107051
                                                                           I/O Timings: read=444.371
                                                                     ->  Hash  (cost=400.01..400.01 rows=5701 width=149) (actual time=2.531..2.546 rows=5103 loops=3)
                                                                           Buckets: 8192  Batches: 1  Memory Usage: 889kB
                                                                           Buffers: shared hit=1029
                                                                           ->  Seq Scan on jobs j  (cost=0.00..400.01 rows=5701 width=149) (actual time=0.026..1.173 rows=5103 loops=3)
                                                                                 Buffers: shared hit=1029
                                                               ->  Hash  (cost=400.01..400.01 rows=5701 width=149) (actual time=2.235..2.245 rows=5103 loops=3)
                                                                     Buckets: 8192  Batches: 1  Memory Usage: 958kB
                                                                     Buffers: shared hit=1029
                                                                     ->  Seq Scan on jobs j_1  (cost=0.00..400.01 rows=5701 width=149) (actual time=0.010..0.889 rows=5103 loops=3)
                                                                           Filter: ((is_hidden IS FALSE) AND (symlink_target_uuid IS NULL))
                                                                           Buffers: shared hit=1029
                                                         ->  Hash  (cost=2055071.46..2055071.46 rows=103266 width=48) (actual time=117852.864..117852.871 rows=2344707 loops=3)
                                                               Buckets: 131072 (originally 131072)  Batches: 256 (originally 1)  Memory Usage: 64541kB
                                                               Buffers: shared hit=14294551 read=1515436, temp written=3847428
                                                               I/O Timings: read=7533.216
                                                               ->  Subquery Scan on f  (cost=0.44..2055071.46 rows=103266 width=48) (actual time=0.167..86922.405 rows=2344707 loops=3)
                                                                     Buffers: shared hit=14294551 read=1515436
                                                                     I/O Timings: read=7533.216
                                                                     ->  GroupAggregate  (cost=0.44..2054038.80 rows=103266 width=48) (actual time=0.164..86523.141 rows=2344707 loops=3)
                                                                           Group Key: run_facets.run_uuid
                                                                           Buffers: shared hit=14294551 read=1515436
                                                                           I/O Timings: read=7533.216
                                                                           ->  Index Scan using run_facets_run_uuid_idx on run_facets  (cost=0.44..1948491.64 rows=20851267 width=490) (actual time=0.045..11806.854 rows=20986865 loops=3)
                                                                                 Buffers: shared hit=14291605 read=1515256
                                                                                 I/O Timings: read=7531.938
                                                   ->  Parallel Hash  (cost=2602.31..2602.31 rows=35031 width=198) (actual time=14.083..14.086 rows=19839 loops=3)
                                                         Buckets: 65536  Batches: 1  Memory Usage: 14240kB
                                                         Buffers: shared hit=533 read=1719
                                                         I/O Timings: read=8.312
                                                         ->  Parallel Seq Scan on run_args ra  (cost=0.00..2602.31 rows=35031 width=198) (actual time=0.020..17.847 rows=59518 loops=1)
                                                               Buffers: shared hit=533 read=1719
                                                               I/O Timings: read=8.312
                                             ->  Hash  (cost=278.52..278.52 rows=6252 width=32) (actual time=2.775..2.783 rows=6194 loops=3)
                                                   Buckets: 8192  Batches: 1  Memory Usage: 452kB
                                                   Buffers: shared hit=432 read=216
                                                   I/O Timings: read=0.879
                                                   ->  Seq Scan on job_versions jv  (cost=0.00..278.52 rows=6252 width=32) (actual time=0.033..1.596 rows=6194 loops=3)
                                                         Buffers: shared hit=432 read=216
                                                         I/O Timings: read=0.879
                                       ->  Hash  (cost=33749.22..33749.22 rows=36165 width=48) (actual time=1038.190..1038.216 rows=58795 loops=3)
                                             Buckets: 65536  Batches: 1  Memory Usage: 52475kB
                                             Buffers: shared hit=453206 read=23402
                                             I/O Timings: read=106.552
                                             ->  Subquery Scan on ri  (cost=0.85..33749.22 rows=36165 width=48) (actual time=0.290..992.534 rows=58795 loops=3)
                                                   Buffers: shared hit=453206 read=23402
                                                   I/O Timings: read=106.552
                                                   ->  GroupAggregate  (cost=0.85..33387.57 rows=36165 width=48) (actual time=0.282..985.272 rows=58795 loops=3)
                                                         Group Key: im.run_uuid
                                                         Buffers: shared hit=453206 read=23402
                                                         I/O Timings: read=106.552
                                                         ->  Nested Loop  (cost=0.85..31451.59 rows=197855 width=150) (actual time=0.165..349.372 rows=197521 loops=3)
                                                               Buffers: shared hit=453203 read=23399
                                                               I/O Timings: read=106.521
                                                               ->  Index Only Scan using runs_input_mapping_pkey on runs_input_mapping im  (cost=0.42..7899.56 rows=197855 width=32) (actual time=0.059..29.604 rows=197521 loops=3)
                                                                     Heap Fetches: 22179
                                                                     Buffers: shared hit=5933 read=1251
                                                                     I/O Timings: read=5.352
                                                               ->  Memoize  (cost=0.43..1.82 rows=1 width=134) (actual time=0.001..0.001 rows=1 loops=592563)
                                                                     Cache Key: im.dataset_version_uuid
                                                                     Cache Mode: logical
                                                                     Hits: 158403  Misses: 39118  Evictions: 0  Overflows: 0  Memory Usage: 10004kB
                                                                     Buffers: shared hit=447270 read=22148
                                                                     I/O Timings: read=101.169
                                                                     Worker 0:  Hits: 158403  Misses: 39118  Evictions: 0  Overflows: 0  Memory Usage: 10004kB
                                                                     Worker 1:  Hits: 158403  Misses: 39118  Evictions: 0  Overflows: 0  Memory Usage: 10004kB
                                                                     ->  Index Scan using dataset_versions_pkey on dataset_versions dv  (cost=0.42..1.81 rows=1 width=134) (actual time=0.005..0.005 rows=1 loops=117354)
                                                                           Index Cond: (uuid = im.dataset_version_uuid)
                                                                           Buffers: shared hit=447270 read=22148
                                                                           I/O Timings: read=101.169
                                 ->  Hash  (cost=109813.38..109813.38 rows=369958 width=48) (actual time=2184.071..2184.081 rows=368533 loops=3)
                                       Buckets: 524288 (originally 524288)  Batches: 2 (originally 1)  Memory Usage: 61441kB
                                       Buffers: shared hit=724594 read=47026, temp written=19623
                                       I/O Timings: read=209.938
                                       ->  Subquery Scan on ro  (cost=0.42..109813.38 rows=369958 width=48) (actual time=0.118..1888.928 rows=368534 loops=3)
                                             Buffers: shared hit=724594 read=47026
                                             I/O Timings: read=209.938
                                             ->  GroupAggregate  (cost=0.42..106113.80 rows=369958 width=48) (actual time=0.104..1847.626 rows=368534 loops=3)
                                                   Group Key: dataset_versions.run_uuid
                                                   Buffers: shared hit=724594 read=47026
                                                   I/O Timings: read=209.938
                                                   ->  Index Scan using dataset_versions_run_uuid on dataset_versions  (cost=0.42..98701.81 rows=371668 width=150) (actual time=0.048..364.140 rows=369996 loops=3)
                                                         Buffers: shared hit=724594 read=47026
                                                         I/O Timings: read=209.938
                           ->  Hash  (cost=195467.53..195467.53 rows=17295 width=48) (actual time=4647.115..4647.131 rows=210444 loops=3)
                                 Buckets: 131072 (originally 32768)  Batches: 2 (originally 1)  Memory Usage: 64513kB
                                 Buffers: shared hit=382487 read=191255, temp read=36366 written=56862
                                 I/O Timings: read=1048.665
                                 ->  Subquery Scan on df  (cost=192614.76..195467.53 rows=17295 width=48) (actual time=2788.159..4389.632 rows=210444 loops=3)
                                       Buffers: shared hit=382487 read=191255, temp read=36366 written=36372
                                       I/O Timings: read=1048.665
                                       ->  GroupAggregate  (cost=192614.76..195294.58 rows=17295 width=48) (actual time=2788.139..4364.211 rows=210444 loops=3)
                                             Group Key: dataset_facets.run_uuid
                                             Buffers: shared hit=382487 read=191255, temp read=36366 written=36372
                                             I/O Timings: read=1048.665
                                             ->  Sort  (cost=192614.76..193230.67 rows=246364 width=692) (actual time=2787.684..2863.614 rows=256134 loops=3)
                                                   Sort Key: dataset_facets.run_uuid
                                                   Sort Method: external merge  Disk: 96976kB
                                                   Buffers: shared hit=382487 read=191255, temp read=36366 written=36372
                                                   I/O Timings: read=1048.665
                                                   Worker 0:  Sort Method: external merge  Disk: 96976kB
                                                   Worker 1:  Sort Method: external merge  Disk: 96976kB
                                                   ->  Index Scan using "dataset_facets-io-text" on dataset_facets  (cost=0.42..94763.33 rows=246364 width=692) (actual time=0.075..2265.111 rows=256134 loops=3)
                                                         Buffers: shared hit=382484 read=191244
                                                         I/O Timings: read=1048.536
                     ->  Hash  (cost=1505.24..1505.24 rows=13 width=16) (actual time=1062.148..1062.197 rows=400 loops=3)
                           Buckets: 1024  Batches: 1  Memory Usage: 27kB
                           Buffers: shared hit=3562 read=736
                           I/O Timings: read=3.614
                           ->  Nested Loop  (cost=471.70..1505.24 rows=13 width=16) (actual time=1060.618..1062.118 rows=400 loops=3)
                                 Buffers: shared hit=3562 read=736
                                 I/O Timings: read=3.614
                                 ->  Nested Loop  (cost=471.27..1388.65 rows=1 width=16) (actual time=1060.554..1060.593 rows=1 loops=3)
                                       Join Filter: ((COALESCE(j_3.name, j_2.name))::text = (j_4.name)::text)
                                       Rows Removed by Join Filter: 2950
                                       Buffers: shared hit=2744 read=343
                                       I/O Timings: read=1.612
                                       ->  Hash Left Join  (cost=471.27..886.25 rows=29 width=250) (actual time=1057.617..1059.026 rows=2951 loops=3)
                                             Hash Cond: (j_2.symlink_target_uuid = j_3.uuid)
                                             Filter: ((COALESCE(j_3.namespace_name, j_2.namespace_name))::text = 'yarn://odin'::text)
                                             Rows Removed by Filter: 2152
                                             Buffers: shared hit=1715 read=343
                                             I/O Timings: read=1.612
                                             ->  Seq Scan on jobs j_2  (cost=0.00..400.01 rows=5701 width=149) (actual time=1054.022..1054.847 rows=5103 loops=3)
                                                   Buffers: shared hit=1028 read=1
                                                   I/O Timings: read=0.008
                                             ->  Hash  (cost=400.01..400.01 rows=5701 width=149) (actual time=3.553..3.558 rows=5103 loops=3)
                                                   Buckets: 8192  Batches: 1  Memory Usage: 958kB
                                                   Buffers: shared hit=687 read=342
                                                   I/O Timings: read=1.603
                                                   ->  Seq Scan on jobs j_3  (cost=0.00..400.01 rows=5701 width=149) (actual time=0.019..2.476 rows=5103 loops=3)
                                                         Filter: ((is_hidden IS FALSE) AND (symlink_target_uuid IS NULL))
                                                         Buffers: shared hit=687 read=342
                                                         I/O Timings: read=1.603
                                       ->  Materialize  (cost=0.00..499.81 rows=6 width=133) (actual time=0.000..0.000 rows=1 loops=8853)
                                             Buffers: shared hit=1029
                                             ->  Seq Scan on jobs j_4  (cost=0.00..499.78 rows=6 width=133) (actual time=0.825..0.829 rows=1 loops=3)
                                                   Filter: ((is_hidden IS FALSE) AND (symlink_target_uuid IS NULL) AND ((namespace_name)::text = 'yarn://odin'::text) AND (((name)::text = 'etl_coredl_dh_cdm_odin_agg_acc_mob_payment_d_agg_acc_mob_payment_d_frs_kbn.execute_insert_into_hadoop_fs_relation_command'::text) OR ((name)::text = ANY ((aliases)::text[]))))
                                                   Rows Removed by Filter: 5102
                                                   Buffers: shared hit=1029
                                 ->  Index Scan using runs_job_uuid on runs r_1  (cost=0.43..104.84 rows=1175 width=32) (actual time=0.054..1.455 rows=400 loops=3)
                                       Index Cond: (job_uuid = j_2.uuid)
                                       Buffers: shared hit=818 read=393
                                       I/O Timings: read=2.002
 Planning:
   Buffers: shared hit=99 read=35
   I/O Timings: read=0.386
 Planning Time: 14.904 ms
 JIT:
   Functions: 379
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 56.493 ms, Inlining 368.261 ms, Optimization 2880.508 ms, Emission 1582.286 ms, Total 4887.547 ms
 Execution Time: 168626.484 ms
(206 rows)

Is there any need to join dataset_facets, input & output versions only to show the list of run names and the timestamp + state of the latest run within each job?

@dolfinus dolfinus changed the title Very slow /jobs endpoint after upgrading to 0.50.0 Very slow /api/v1/jobs endpoint after upgrading to 0.50.0 Dec 12, 2024
@mattwparas
Copy link
Contributor

We're experiencing the same slowness as well - our preliminary analysis seemed to second the above

@lmassaoy
Copy link

lmassaoy commented Jan 2, 2025

We're facing the same slowness. The reason appears to be that we migrated from version 0.47.0 to 0.50.0, keeping the same database. We tested a testing deployment using a brand-new RDS, and the performance was significantly better.

@dolfinus
Copy link
Contributor Author

dolfinus commented Jan 2, 2025

I've tried to start from the fresh database, but after consuming the same amount of lineage data, the slowness appeared again. I have to use 0.49.0 to fix that, on a fresh database (because I didn't get how to downgrade flyway migrations).

@jonathanpmoraes
Copy link

We tested deploying a new RDS Postgres base with the latest Marquez release and it worked initially but when it got up to 6k jobs, it became slow again. All other queries not related to jobs are pretty fast though. I wonder if is the query format that is causing issues with the performance.

@jonathanpmoraes
Copy link

When it comes to the JobResource.java class, I compared it with versions 0.47 and 0.50 and got some interesing findings. The new code always loads the jobs plus their runs (e.g., via findAllWithRun and the defaulting of lastRunStates). That extra join/query fetches a lot more data when you have thousands of jobs, like it is in my case. Maybe, if we could avoid loading runs inline or make the run fetching optional/paginated—e.g., limit the run query, create indexes for filtering, or remove the extra joins unless explicitly requested, it would work faster.

@jonathanpmoraes
Copy link

Looking at the code again, I don't believe there's a need to let the public Response list methods fetch an unlimited number of resources. I believe this is the error that's causing the slow down issue. Maybe by enforcing a set limit, say 100, it would definetely run faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants