-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instability in test_create_churn_during_restart
#9730
Comments
This test was seen to be flaky, e.g. at https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9457/11804246485/index.html#suites/ec4311502db344eee91f1354e9dc839b/982bd121ea698414/.If I _reduce_ the timeout from 10s to 8s on my laptop, it reliably hits that timeout and fails. That suggests that the test is pretty close to the edge even when it passes. Let's bump up the timeout to 30 s to make it more robust. See also #9730, although the error message is different there.
This test was seen to be flaky, e.g. at https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9457/11804246485/index.html#suites/ec4311502db344eee91f1354e9dc839b/982bd121ea698414/. If I _reduce_ the timeout from 10s to 8s on my laptop, it reliably hits that timeout and fails. That suggests that the test is pretty close to the edge even when it passes. Let's bump up the timeout to 30 s to make it more robust. See also #9730, although the error message is different there.
While looking at a timeout failure log in this test (#9736 (review)), one really strange thing is that we're creating a timeline at all in the main loop (it's meant to be sending redundant timeline creation requests for the env.initial_timeline) -- I think that's a pageserver bug, perhaps a regression from recent changes to timeline lifecycle stuff for offloading. If we get to the bottom of that, then revert that bump to the timeout. |
The test environment is initialized with:
That doesn't create the initial tenant/timeline, it merely generates random IDs for them. So one of the 'tenant_create' and 'timeline_create' calls in the main loop will succeed to create the tenant and timeline. |
Hmm, I see two initdb calls in the pageserver log though, for the first iteration of the loop. One of them is apparently cancelled, (I added debug print here and it was printed), and the second one succeeds and creates the timeline. Weird... |
This test was seen to be flaky, e.g. at: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9457/11804246485/index.html#suites/ec4311502db344eee91f1354e9dc839b/982bd121ea698414/. If I _reduce_ the timeout from 10s to 8s on my laptop, it reliably hits that timeout and fails. That suggests that the test is pretty close to the edge even when it passes. Let's bump up the timeout to 30 s to make it more robust. See also #9730, although the error message is different there.
what's the expected behavior of the test before the restart? looking at the log, it seems that the initial tenant/timeline is not created successfully,
I think we should fix the test by:
|
not created successfully b/c the pageserver receives a shutdown request immediately after the request is being handled. |
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9715/11793083611/index.html#/testresult/18965ea660d34c01
Failing since Wednesday 6th
The text was updated successfully, but these errors were encountered: