Skip to content

Commit

Permalink
Fix vacuum_progress isolation tests
Browse files Browse the repository at this point in the history
Fix vacuum_progress isolation tests

Isolation tests vacuum_progress_column and vacuum_progress_row would
previously fail sometimes because vacuum_worker_changed fault injector was
never reached. This happened because vacuum execution and FTS version upgrade
are not synchronized: after FTS disables syncrep according to the test, it
should bump FTS version, which should be detected by the dispatcher on the
next StartTransaction call, and then the segment backends will be restarted
(making vacuum_worker_changed fault accessible). However, if FTS is too slow
(or dispatcher too fast), it might skip StartTransaction calls before FTS
version is bumped, and the segment restart will happen after vacuum is
complete, without triggering vacuum_worker_changed. This is an acceptable
behavior for the code since the restart still happens milliseconds after
disabling syncrep, but the test doesn't expect it.

This patch fixes the test by making the dispatcher wait on
vacuum_rel_finished_one_relation fault until waits until mirror down is
detected by FTS, ensuring FTS version bump happens before dispatcher proceeds.

Note that the patch also changes expected output: it seems like previously
restart happened on the last StartTransaction, making all the collected
stats to be zero. However, post-cleanup vacuum phase (which is executed after
vacuum_worker_changed) is also responsible for vacuuming the indexes, so the
correct output should include those too. Note that num_dead_tuples value is
still zero, indicating that vacrelstats was indeed reset correctly.
  • Loading branch information
silent-observer authored Nov 27, 2024
1 parent e8aa067 commit e60a012
Show file tree
Hide file tree
Showing 4 changed files with 60 additions and 8 deletions.
27 changes: 23 additions & 4 deletions src/test/isolation2/expected/vacuum_progress_column.out
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,13 @@ select relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, he
--------------------------
Success:
(1 row)
-- Dispatcher shouldn't proceed until status_version in FtsVersion in FtsProbeInfo is updated,
-- ensuring the backends are restarted before they start the post-cleanup phase.
2: SELECT gp_inject_fault('vacuum_rel_finished_one_relation', 'suspend', '', '', 'vacuum_progress_ao_column', 1, 1, 0, 1) FROM master();
gp_inject_fault
-----------------
Success:
(1 row)
2: SELECT gp_inject_fault('vacuum_ao_after_compact', 'reset', dbid) FROM gp_segment_configuration WHERE content > -1 AND role = 'p';
gp_inject_fault
-----------------
Expand Down Expand Up @@ -347,6 +354,18 @@ select relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, he
--------------------------
Success:
(1 row)
-- wait for the mirror stop to be detected (timeout 2 minutes)
2: SELECT wait_for_mirror_down(1::smallint, 120);
wait_for_mirror_down
----------------------
t
(1 row)
2: SELECT gp_inject_fault('vacuum_rel_finished_one_relation', 'reset', dbid) FROM master();
gp_inject_fault
-----------------
Success:
(1 row)

-- Ensure we enter into the target logic which stops cumulative data but
-- initializes a new vacrelstats at the beginning of post-cleanup phase.
-- Also all segments should reach to the same "vacuum_worker_changed" point
Expand Down Expand Up @@ -397,14 +416,14 @@ select relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, he
select gp_segment_id, relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, index_vacuum_count, max_dead_tuples, num_dead_tuples from gp_stat_progress_vacuum where gp_segment_id > -1;
gp_segment_id | relname | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples
---------------+---------------------------+-------------------------------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
2 | vacuum_progress_ao_column | append-optimized post-cleanup | 0 | 0 | 0 | 0 | 0 | 0
0 | vacuum_progress_ao_column | append-optimized post-cleanup | 0 | 0 | 0 | 0 | 0 | 0
1 | vacuum_progress_ao_column | append-optimized post-cleanup | 0 | 0 | 0 | 0 | 0 | 0
2 | vacuum_progress_ao_column | append-optimized post-cleanup | 0 | 0 | 9 | 2 | 0 | 0
1 | vacuum_progress_ao_column | append-optimized post-cleanup | 0 | 0 | 9 | 2 | 0 | 0
0 | vacuum_progress_ao_column | append-optimized post-cleanup | 0 | 0 | 9 | 2 | 0 | 0
(3 rows)
select relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, index_vacuum_count, max_dead_tuples, num_dead_tuples from gp_stat_progress_vacuum_summary;
relname | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples
---------------------------+-------------------------------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
vacuum_progress_ao_column | append-optimized post-cleanup | 0 | 0 | 0 | 0 | 0 | 0
vacuum_progress_ao_column | append-optimized post-cleanup | 0 | 0 | 27 | 6 | 0 | 0
(1 row)

2: SELECT gp_inject_fault('vacuum_ao_post_cleanup_end', 'reset', dbid) FROM gp_segment_configuration WHERE content > -1 AND role = 'p';
Expand Down
27 changes: 23 additions & 4 deletions src/test/isolation2/expected/vacuum_progress_row.out
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,13 @@ select relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, he
--------------------------
Success:
(1 row)
-- Dispatcher shouldn't proceed until status_version in FtsVersion in FtsProbeInfo is updated,
-- ensuring the backends are restarted before they start the post-cleanup phase.
2: SELECT gp_inject_fault('vacuum_rel_finished_one_relation', 'suspend', '', '', 'vacuum_progress_ao_row', 1, 1, 0, 1) FROM master();
gp_inject_fault
-----------------
Success:
(1 row)
2: SELECT gp_inject_fault('vacuum_ao_after_compact', 'reset', dbid) FROM gp_segment_configuration WHERE content > -1 AND role = 'p';
gp_inject_fault
-----------------
Expand Down Expand Up @@ -400,6 +407,18 @@ select relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, he
--------------------------
Success:
(1 row)
-- wait for the mirror stop to be detected (timeout 2 minutes)
2: SELECT wait_for_mirror_down(1::smallint, 120);
wait_for_mirror_down
----------------------
t
(1 row)
2: SELECT gp_inject_fault('vacuum_rel_finished_one_relation', 'reset', dbid) FROM master();
gp_inject_fault
-----------------
Success:
(1 row)

-- Ensure we enter into the target logic which stops cumulative data but
-- initializes a new vacrelstats at the beginning of post-cleanup phase.
-- Also all segments should reach to the same "vacuum_worker_changed" point
Expand Down Expand Up @@ -450,14 +469,14 @@ select relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, he
select gp_segment_id, relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, index_vacuum_count, max_dead_tuples, num_dead_tuples from gp_stat_progress_vacuum where gp_segment_id > -1;
gp_segment_id | relname | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples
---------------+------------------------+-------------------------------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
0 | vacuum_progress_ao_row | append-optimized post-cleanup | 0 | 0 | 0 | 0 | 0 | 0
2 | vacuum_progress_ao_row | append-optimized post-cleanup | 0 | 0 | 0 | 0 | 0 | 0
1 | vacuum_progress_ao_row | append-optimized post-cleanup | 0 | 0 | 0 | 0 | 0 | 0
2 | vacuum_progress_ao_row | append-optimized post-cleanup | 0 | 0 | 19 | 2 | 0 | 0
0 | vacuum_progress_ao_row | append-optimized post-cleanup | 0 | 0 | 19 | 2 | 0 | 0
1 | vacuum_progress_ao_row | append-optimized post-cleanup | 0 | 0 | 19 | 2 | 0 | 0
(3 rows)
select relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, index_vacuum_count, max_dead_tuples, num_dead_tuples from gp_stat_progress_vacuum_summary;
relname | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples
------------------------+-------------------------------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
vacuum_progress_ao_row | append-optimized post-cleanup | 0 | 0 | 0 | 0 | 0 | 0
vacuum_progress_ao_row | append-optimized post-cleanup | 0 | 0 | 57 | 6 | 0 | 0
(1 row)

2: SELECT gp_inject_fault('vacuum_ao_post_cleanup_end', 'reset', dbid) FROM gp_segment_configuration WHERE content > -1 AND role = 'p';
Expand Down
7 changes: 7 additions & 0 deletions src/test/isolation2/sql/vacuum_progress_column.sql
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,9 @@ select relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, he

-- Resume execution of compact phase and block at syncrep on one segment.
2: SELECT gp_inject_fault_infinite('wal_sender_loop', 'suspend', dbid) FROM gp_segment_configuration WHERE role = 'p' and content = 1;
-- Dispatcher shouldn't proceed until status_version in FtsVersion in FtsProbeInfo is updated,
-- ensuring the backends are restarted before they start the post-cleanup phase.
2: SELECT gp_inject_fault('vacuum_rel_finished_one_relation', 'suspend', '', '', 'vacuum_progress_ao_column', 1, 1, 0, 1) FROM master();
2: SELECT gp_inject_fault('vacuum_ao_after_compact', 'reset', dbid) FROM gp_segment_configuration WHERE content > -1 AND role = 'p';
-- stop the mirror should turn off syncrep
2: SELECT pg_ctl(datadir, 'stop', 'immediate') FROM gp_segment_configuration WHERE content = 1 AND role = 'm';
Expand All @@ -141,6 +144,10 @@ select relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, he
2: SELECT gp_inject_fault('vacuum_worker_changed', 'suspend', dbid) FROM gp_segment_configuration WHERE content > -1 AND role = 'p';
-- resume walsender and let it exit so that mirror stop can be detected
2: SELECT gp_inject_fault_infinite('wal_sender_loop', 'reset', dbid) FROM gp_segment_configuration WHERE role = 'p' and content = 1;
-- wait for the mirror stop to be detected (timeout 2 minutes)
2: SELECT wait_for_mirror_down(1::smallint, 120);
2: SELECT gp_inject_fault('vacuum_rel_finished_one_relation', 'reset', dbid) FROM master();

-- Ensure we enter into the target logic which stops cumulative data but
-- initializes a new vacrelstats at the beginning of post-cleanup phase.
-- Also all segments should reach to the same "vacuum_worker_changed" point
Expand Down
7 changes: 7 additions & 0 deletions src/test/isolation2/sql/vacuum_progress_row.sql
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,9 @@ select relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, he

-- Resume execution of compact phase and block at syncrep.
2: SELECT gp_inject_fault_infinite('wal_sender_loop', 'suspend', dbid) FROM gp_segment_configuration WHERE role = 'p' and content = 1;
-- Dispatcher shouldn't proceed until status_version in FtsVersion in FtsProbeInfo is updated,
-- ensuring the backends are restarted before they start the post-cleanup phase.
2: SELECT gp_inject_fault('vacuum_rel_finished_one_relation', 'suspend', '', '', 'vacuum_progress_ao_row', 1, 1, 0, 1) FROM master();
2: SELECT gp_inject_fault('vacuum_ao_after_compact', 'reset', dbid) FROM gp_segment_configuration WHERE content > -1 AND role = 'p';
-- stop the mirror should turn off syncrep
2: SELECT pg_ctl(datadir, 'stop', 'immediate') FROM gp_segment_configuration WHERE content=1 AND role = 'm';
Expand All @@ -145,6 +148,10 @@ select relid::regclass as relname, phase, heap_blks_total, heap_blks_scanned, he
2: SELECT gp_inject_fault('vacuum_worker_changed', 'suspend', dbid) FROM gp_segment_configuration WHERE content > -1 AND role = 'p';
-- resume walsender and let it exit so that mirror stop can be detected
2: SELECT gp_inject_fault_infinite('wal_sender_loop', 'reset', dbid) FROM gp_segment_configuration WHERE role = 'p' and content = 1;
-- wait for the mirror stop to be detected (timeout 2 minutes)
2: SELECT wait_for_mirror_down(1::smallint, 120);
2: SELECT gp_inject_fault('vacuum_rel_finished_one_relation', 'reset', dbid) FROM master();

-- Ensure we enter into the target logic which stops cumulative data but
-- initializes a new vacrelstats at the beginning of post-cleanup phase.
-- Also all segments should reach to the same "vacuum_worker_changed" point
Expand Down

0 comments on commit e60a012

Please sign in to comment.