`test_pg_regress` failure in `portals` #10278

jcsp · 2025-01-03T21:51:44Z

This test run is from the tip of main on 2025-01-02
https://neon-github-public-dev.s3.amazonaws.com/reports/main/12581797874/index.html#testresult/1efe3a56821553c7/

not ok 101   + portals                                  1388 ms

The .diff shows many unexpected results, snippet:

@@ -34,8 +34,8 @@
 FETCH 2 in foo2;
  unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
 ---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
-    8800 |       0 |   0 |    0 |   0 |      0 |       0 |      800 |         800 |      3800 |     8800 |   0 |    1 | MAAAAA   | AAAAAA   | AAAAxx
-    1891 |       1 |   1 |    3 |   1 |     11 |      91 |      891 |        1891 |      1891 |     1891 | 182 |  183 | TUAAAA   | BAAAAA   | HHHHxx
+    9437 |    9744 |   1 |    1 |   7 |     17 |      37 |      437 |        1437 |      4437 |     9437 |  74 |   75 | ZYAAAA   | UKOAAA   | AAAAxx
+    1821 |    9745 |   1 |    1 |   1 |      1 |      21 |      821 |        1821 |      1821 |     1821 |  42 |   43 | BSAAAA   | VKOAAA   | HHHHxx
 (2 rows)

The text was updated successfully, but these errors were encountered:

jcsp · 2025-01-03T21:57:10Z

Looked back 30 days, there was one other case of this on 2024-12-07: https://neon-github-public-dev.s3.amazonaws.com/reports/main/12213126318/index.html#/testresult/4fed18972c23e9c0

alexanderlaw · 2025-01-10T07:21:02Z

An intermediate result: to reproduce/diagnose the failure, I adjusted tests in the portals' line:
test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregates transactions random portals arrays btree_index hash_index update delete namespace prepared_xacts

to make them repeatable (probably not all of the tests needed for reproducing, I haven't tried to reduce the list yet); repeated this line in parallel_schedule 5 times (see parallel_schedule attached), and added the following to portals.sql (and reflected this change in expected/portals.out):

@@ -213,6 +213,10 @@ BEGIN;
 
 DECLARE foo25 SCROLL CURSOR WITH HOLD FOR SELECT * FROM tenk2;
 
+EXPLAIN (VERBOSE ON, COSTS OFF) SELECT * FROM tenk2;
+
+SELECT ctid, * FROM tenk2;
+
 FETCH FROM foo25;

The complete patch for tests: portals-test-diagnostic.zip

With these changes applied, I run test_runner/regress/test_pg_regress.py::test_pg_regress[release-pg16-None] with parallel (-j15), and can reproduce the failure after several iterations. e. g.:

ITERATION 19
1       ================== 1 failed, 2 warnings in 372.19s (0:06:12) ===================
4       ================== 1 passed, 2 warnings in 376.36s (0:06:16) ===================
14      ================== 1 passed, 2 warnings in 376.55s (0:06:16) ===================
...
test_pg_regress_1[release-pg16-None\]/regress/regression.out

# parallel group (20 tests):  select_distinct_on random select_having select_implicit delete case prepared_xacts select_into select_distinct namespace transactions union portals arrays subselect hash_index btree_index aggregates update join
ok 149       + select_into                              8494 ms
ok 150       + select_distinct                          8494 ms
ok 151       + select_distinct_on                       3429 ms
ok 152       + select_implicit                          5573 ms
ok 153       + select_having                            5295 ms
ok 154       + subselect                               33071 ms
ok 155       + union                                   10672 ms
ok 156       + case                                     6887 ms
ok 157       + join                                    59410 ms
ok 158       + aggregates                              59269 ms
ok 159       + transactions                             9121 ms
ok 160       + random                                   5293 ms
not ok 161   + portals                                 10908 ms
ok 162       + arrays                                  29247 ms
ok 163       + btree_index                             59170 ms
ok 164       + hash_index                              38379 ms
ok 165       + update                                  59402 ms
ok 166       + delete                                   6532 ms
ok 167       + namespace                                8583 ms
ok 168       + prepared_xacts                           7254 ms
# parallel group (20 tests):  select_distinct_on random select_having select_implicit delete case prepared_xacts namespace select_distinct select_into transactions union portals arrays subselect hash_index btree_index aggregates update join
ok 169       + select_into                              6558 ms
ok 170       + select_distinct                          6556 ms
ok 171       + select_distinct_on                       1339 ms
ok 172       + select_implicit                          2822 ms
ok 173       + select_having                            2270 ms
ok 174       + subselect                               17409 ms
ok 175       + union                                   10248 ms
ok 176       + case                                     3857 ms
ok 177       + join                                    50480 ms
ok 178       + aggregates                              48611 ms
ok 179       + transactions                             8336 ms
ok 180       + random                                   1772 ms
not ok 181   + portals                                 10604 ms
ok 182       + arrays                                  10671 ms
ok 183       + btree_index                             43445 ms
ok 184       + hash_index                              40468 ms
ok 185       + update                                  49927 ms
ok 186       + delete                                   3300 ms
ok 187       + namespace                                5364 ms
ok 188       + prepared_xacts                           4499 ms
1..188
# 2 of 188 tests failed.

Please look at the regression.diffs.zip produced.

compute.log doesn't contain interesting information (I see no messages from autovacuum or alike), will try to increase logging level and reproduce once again.

knizhnik · 2025-01-10T07:30:14Z

Can you try to reproduce it with Vanilla?
The problem seems got be clear: portals test expect some particular order of records returned by cursor.
But it is not guaranteed in SQL. And in case of synchronized seqscan (when several backends concurrently traverse the same table), this order may be different.

The patch seems top be obvious: exclude portals from the test group.
The question is what should we do: reproduce and report bug to community?
Fix it only in Neon?

alexanderlaw · 2025-01-10T07:40:54Z

But I wonder, why after the first failure:
not ok 161 + portals 10908 ms
there was second:
not ok 181 + portals 10604 ms
(with additional differences, including results of SELECT)?

If this is the case of concurrent backends reading the same table, how can the effect be accumulated?

Couldn't you suggest some legitimate change to postgres code to ease reproducing the issue as you see it?

alexanderlaw · 2025-01-10T08:07:35Z

One more run:

# parallel group (20 tests):  select_distinct_on select_having random select_implicit delete prepared_xacts case namespace select_into transactions select_distinct arrays union portals hash_index subselect btree_index update join aggregates
...
ok 101       + portals                                 30240 ms
...
# parallel group (20 tests):  select_having select_distinct_on random select_implicit delete prepared_xacts case namespace select_into transactions select_distinct portals arrays union subselect hash_index update btree_index join aggregates
...
not ok 121   + portals                                 21661 ms
...
# parallel group (20 tests):  select_distinct_on select_having select_implicit random delete case namespace prepared_xacts select_into transactions select_distinct arrays union portals subselect hash_index update join btree_index aggregates
...
not ok 141   + portals                                 33281 ms
...
# parallel group (20 tests):  select_distinct_on select_having random select_implicit delete case prepared_xacts namespace select_into select_distinct transactions union arrays portals subselect hash_index update join btree_index aggregates
...
not ok 161   + portals                                 31043 ms
...
# parallel group (20 tests):  select_distinct_on select_having random select_implicit delete prepared_xacts case namespace select_distinct select_into transactions portals union arrays subselect hash_index btree_index update join aggregates
...
not ok 181   + portals                                 27208 ms
...
1..188
# 4 of 188 tests failed.

regression.diffs.zip

knizhnik · 2025-01-10T08:18:10Z

If this is the case of concurrent backends reading the same table, how can the effect be accumulated?

Effect is not accumulated.
Please notice that different queries are failed. Portal test contains statements, selecting different number of records from cursor. For example FETCH 4 in foo4; selects 4 records. And if there is some other backend reading this table at this moment then all 4 records will be different.

alexanderlaw · 2025-01-10T15:30:22Z

I've reproduced the issue once more, this time with autovacuum disabled, that line repeated 7 times, and additional logging:

# parallel group (20 tests):  select_distinct_on select_having select_implicit random delete case prepared_xacts namespace select_into transactions select_distinct arrays union portals subselect hash_index update join btree_index aggregates
...
ok 101       + portals                                 38895 ms
...
# parallel group (20 tests):  select_distinct_on select_having random select_implicit delete prepared_xacts case namespace select_into transactions select_distinct arrays union portals subselect hash_index join update btree_index aggregates
...
ok 121       + portals                                 14836 ms
...
# parallel group (20 tests):  select_distinct_on select_having random select_implicit delete case prepared_xacts select_into namespace select_distinct transactions portals arrays union subselect update join hash_index aggregates btree_index
...
ok 141       + portals                                  7372 ms
...
# parallel group (20 tests):  select_distinct_on select_having random select_implicit delete prepared_xacts case select_into namespace select_distinct transactions union portals arrays subselect hash_index aggregates join update btree_index
...
not ok 161   + portals                                 17792 ms
...
# parallel group (20 tests):  select_distinct_on random select_having select_implicit delete prepared_xacts case namespace select_into select_distinct transactions union arrays portals subselect hash_index join btree_index update aggregates
...
not ok 181   + portals                                 43659 ms
...
# parallel group (20 tests):  select_distinct_on select_having random select_implicit delete prepared_xacts case namespace select_into select_distinct transactions union portals arrays subselect hash_index join update btree_index aggregates
...
not ok 201   + portals                                 31262 ms
...
# parallel group (20 tests):  select_distinct_on select_having random select_implicit delete prepared_xacts case select_distinct namespace select_into transactions union portals arrays subselect hash_index update join aggregates btree_index
...
not ok 221   + portals                                 25481 ms
...
# 4 of 228 tests failed.

As we can see, if the test failed one time, it fails till the end.

regression.diffs.zip

knizhnik · 2025-01-12T07:47:18Z

Autovacuum no somehow related to the problem.
It is more or less obvious that if we have synchronized (two or more backends are concurrently traversing the same table), then it will affect results of most queries.

alexanderlaw · 2025-01-12T08:24:10Z

But in this case, we have no two backends traversing the same table - the only other test, which selects data from tenk2, is "join" (I dropped tenk2 before that group of tests to find this out), and I see the same test failure with a line like:
test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case aggregates transactions random portals arrays btree_index hash_index update delete namespace prepared_xacts
(without join)

So I still think the cause is different and an order of tenk2 pages changes permanently. (As you can see from regression.diffs attached above: #10278 (comment)).

knizhnik · 2025-01-13T07:19:38Z

I still failed to reproduce the problem locally with you patch and Neon/pg17:

./scripts/pytest test_runner/regress/test_pg_regress.py -k test_pg_regress

Also I tried it at Vanilla - tests are passed.

alexanderlaw · 2025-01-13T07:30:37Z

I think, now I finally understand what's happening here. Having reproduced it again, I've confirmed that the order/contents of pages on disk not changed and synchronize_seqscans affects the order of SELECT.

Autovacuum no somehow related to the problem. It is more or less obvious that if we have synchronized (two or more backends are concurrently traversing the same table), then it will affect results of most queries.

That's true if "concurrently" said not about timing, but about number of items in the shared scan_locations struct. As I wrote before, tenk2 is not used by parallel tests in v16 (and the first aforementioned failure: #10278 (comment) was produced in v16) — it used only in "join", but even there it's not scanned — all three queries with tenk2 error out before reaching executor. So maybe it's the queries inside "portals" itself affect one other.

The answer to the question "why we are not observing such failures in the buildfarm/vanilla postgres?" is written in clear text in 027_stream_regress.pl (see cbf4177):

# We'll stick with Cluster->new's small default shared_buffers, but since that
# makes synchronized seqscans more probable, it risks changing the results of
# some test queries.  Disable synchronized seqscans to prevent that.
$node_primary->append_conf('postgresql.conf', 'synchronize_seqscans = off');

Also, see https://www.postgresql.org/message-id/1258185.1648876239%40sss.pgh.pa.us,

So I suppose, moving portals to other test group wouldn't work.

knizhnik · 2025-01-13T09:58:49Z

Yes, concurrent execution of seqscan is not need.
seqscan periodically reports its position if seqscan is allowed (it depends on size of the table and shared buffers).
In Neon with 1MB shared_buffer you can can easily reproduce it:

create table t(x integer unique, t text default repeat('x',100));
insert into t values (genereate_series(1,1000000));
select * from t limit 1000;
  x   |                                                  t                                                   
------+------------------------------------------------------------------------------------------------------
    1 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
...
 1000 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
(1000 rows)

postgres=# select * from t limit 1000;
  x   |                                                  t                                                   
------+------------------------------------------------------------------------------------------------------
  929 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The answer to the question "why we are not observing such failures in the buildfarm/vanilla postgres?" is written in clear text in 027_stream_regress.pl (see cbf4177):

Sorry, I do not understand it. This code fragments disables seqscan for the particular isolation test (027_stream_regress.pl).
But most regression tests are executed with seqscan enabled.
May be it is not reproduced in Vanilla just because with default size of shared buffers (128Mb), this condone will never satisfied for tables used in regression tests, particularly tenk2:

	if (!RelationUsesLocalBuffers(scan->rs_base.rs_rd) &&
		scan->rs_nblocks > NBuffers / 4)

Size of tenk2 table is 345 pages. In Neon regression tests size of shared buffers is 1Mb=128 blocks.
So synchronise scan is enabled for this table.

I do not think that we should do something special with this issue.
As far as I understand it is very rarely reproduced. So we can just ignore it.
Or explicitly disable syncronize_seqscans in `portals.sql test.

alexanderlaw · 2025-01-13T10:20:35Z

Sorry, I do not understand it. This code fragments disables seqscan for the particular isolation test (027_stream_regress.pl).
But most regression tests are executed with seqscan enabled.

The regular (non-TAP) regression run is performed with large shared_buffers. shared_buffers = '128kb' is set only for TAP tests (by Cluster->new), as that comment says.

We can ignore, but what exactly? If we see a test_pg_regress failure, should we ignore it and consider test flaky or should we look inside and ignore only if portals fails? Moreover, what guarantees that no other test will fail in a similar way? (I can see tenk2 used in additional tests in v17, and I would expect more.)

jcsp added c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug labels Jan 3, 2025

jcsp changed the title ~~pg_regress failure in portals~~ test_pg_regress failure in portals Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_pg_regress` failure in `portals` #10278

`test_pg_regress` failure in `portals` #10278

jcsp commented Jan 3, 2025

jcsp commented Jan 3, 2025

alexanderlaw commented Jan 10, 2025

knizhnik commented Jan 10, 2025

alexanderlaw commented Jan 10, 2025

alexanderlaw commented Jan 10, 2025

knizhnik commented Jan 10, 2025 •

edited

Loading

alexanderlaw commented Jan 10, 2025

knizhnik commented Jan 12, 2025 •

edited

Loading

alexanderlaw commented Jan 12, 2025

knizhnik commented Jan 13, 2025

alexanderlaw commented Jan 13, 2025

knizhnik commented Jan 13, 2025

alexanderlaw commented Jan 13, 2025 •

edited

Loading

test_pg_regress failure in portals #10278

test_pg_regress failure in portals #10278

Comments

jcsp commented Jan 3, 2025

jcsp commented Jan 3, 2025

alexanderlaw commented Jan 10, 2025

knizhnik commented Jan 10, 2025

alexanderlaw commented Jan 10, 2025

alexanderlaw commented Jan 10, 2025

knizhnik commented Jan 10, 2025 • edited Loading

alexanderlaw commented Jan 10, 2025

knizhnik commented Jan 12, 2025 • edited Loading

alexanderlaw commented Jan 12, 2025

knizhnik commented Jan 13, 2025

alexanderlaw commented Jan 13, 2025

knizhnik commented Jan 13, 2025

alexanderlaw commented Jan 13, 2025 • edited Loading

`test_pg_regress` failure in `portals` #10278

`test_pg_regress` failure in `portals` #10278

knizhnik commented Jan 10, 2025 •

edited

Loading

knizhnik commented Jan 12, 2025 •

edited

Loading

alexanderlaw commented Jan 13, 2025 •

edited

Loading