-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_pg_regress
failure in portals
#10278
Comments
Looked back 30 days, there was one other case of this on 2024-12-07: https://neon-github-public-dev.s3.amazonaws.com/reports/main/12213126318/index.html#/testresult/4fed18972c23e9c0 |
An intermediate result: to reproduce/diagnose the failure, I adjusted tests in the portals' line: to make them repeatable (probably not all of the tests needed for reproducing, I haven't tried to reduce the list yet); repeated this line in parallel_schedule 5 times (see parallel_schedule attached), and added the following to portals.sql (and reflected this change in expected/portals.out):
The complete patch for tests: portals-test-diagnostic.zip With these changes applied, I run test_runner/regress/test_pg_regress.py::test_pg_regress[release-pg16-None] with parallel (-j15), and can reproduce the failure after several iterations. e. g.:
Please look at the regression.diffs.zip produced. compute.log doesn't contain interesting information (I see no messages from autovacuum or alike), will try to increase logging level and reproduce once again. |
Can you try to reproduce it with Vanilla? The patch seems top be obvious: exclude portals from the test group. |
But I wonder, why after the first failure: If this is the case of concurrent backends reading the same table, how can the effect be accumulated? Couldn't you suggest some legitimate change to postgres code to ease reproducing the issue as you see it? |
One more run:
|
Effect is not accumulated. |
I've reproduced the issue once more, this time with autovacuum disabled, that line repeated 7 times, and additional logging:
As we can see, if the test failed one time, it fails till the end. |
Autovacuum no somehow related to the problem. |
But in this case, we have no two backends traversing the same table - the only other test, which selects data from tenk2, is "join" (I dropped tenk2 before that group of tests to find this out), and I see the same test failure with a line like: So I still think the cause is different and an order of tenk2 pages changes permanently. (As you can see from regression.diffs attached above: #10278 (comment)). |
I still failed to reproduce the problem locally with you patch and Neon/pg17:
Also I tried it at Vanilla - tests are passed. |
I think, now I finally understand what's happening here. Having reproduced it again, I've confirmed that the order/contents of pages on disk not changed and synchronize_seqscans affects the order of SELECT.
That's true if "concurrently" said not about timing, but about number of items in the shared scan_locations struct. As I wrote before, tenk2 is not used by parallel tests in v16 (and the first aforementioned failure: #10278 (comment) was produced in v16) — it used only in "join", but even there it's not scanned — all three queries with tenk2 error out before reaching executor. So maybe it's the queries inside "portals" itself affect one other. The answer to the question "why we are not observing such failures in the buildfarm/vanilla postgres?" is written in clear text in 027_stream_regress.pl (see cbf4177): # We'll stick with Cluster->new's small default shared_buffers, but since that
# makes synchronized seqscans more probable, it risks changing the results of
# some test queries. Disable synchronized seqscans to prevent that.
$node_primary->append_conf('postgresql.conf', 'synchronize_seqscans = off'); Also, see https://www.postgresql.org/message-id/1258185.1648876239%40sss.pgh.pa.us, So I suppose, moving portals to other test group wouldn't work. |
Yes, concurrent execution of seqscan is not need.
Sorry, I do not understand it. This code fragments disables seqscan for the particular isolation test (027_stream_regress.pl).
Size of tenk2 table is 345 pages. In Neon regression tests size of shared buffers is 1Mb=128 blocks. I do not think that we should do something special with this issue. |
The regular (non-TAP) regression run is performed with large shared_buffers. shared_buffers = '128kb' is set only for TAP tests (by Cluster->new), as that comment says. We can ignore, but what exactly? If we see a test_pg_regress failure, should we ignore it and consider test flaky or should we look inside and ignore only if portals fails? Moreover, what guarantees that no other test will fail in a similar way? (I can see tenk2 used in additional tests in v17, and I would expect more.) |
This test run is from the tip of
main
on 2025-01-02https://neon-github-public-dev.s3.amazonaws.com/reports/main/12581797874/index.html#testresult/1efe3a56821553c7/
The .diff shows many unexpected results, snippet:
The text was updated successfully, but these errors were encountered: