Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instability in test_create_churn_during_restart #9730

Open
jcsp opened this issue Nov 12, 2024 · 5 comments · May be fixed by #9767
Open

Instability in test_create_churn_during_restart #9730

jcsp opened this issue Nov 12, 2024 · 5 comments · May be fixed by #9767
Assignees

Comments

@jcsp
Copy link
Collaborator

jcsp commented Nov 12, 2024

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9715/11793083611/index.html#/testresult/18965ea660d34c01

fixtures.pageserver.http.PageserverApiException: wait for timeline initial uploads to complete: queue is in state Stopped

Failing since Wednesday 6th

hlinnaka added a commit that referenced this issue Nov 12, 2024
This test was seen to be flaky, e.g. at
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9457/11804246485/index.html#suites/ec4311502db344eee91f1354e9dc839b/982bd121ea698414/.If
I _reduce_ the timeout from 10s to 8s on my laptop, it reliably hits
that timeout and fails. That suggests that the test is pretty close to
the edge even when it passes. Let's bump up the timeout to 30 s to
make it more robust.

See also #9730, although
the error message is different there.
hlinnaka added a commit that referenced this issue Nov 12, 2024
This test was seen to be flaky, e.g. at
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9457/11804246485/index.html#suites/ec4311502db344eee91f1354e9dc839b/982bd121ea698414/.
If I _reduce_ the timeout from 10s to 8s on my laptop, it reliably
hits that timeout and fails. That suggests that the test is pretty
close to the edge even when it passes. Let's bump up the timeout to 30
s to make it more robust.

See also #9730, although
the error message is different there.
@jcsp
Copy link
Collaborator Author

jcsp commented Nov 13, 2024

While looking at a timeout failure log in this test (#9736 (review)), one really strange thing is that we're creating a timeline at all in the main loop (it's meant to be sending redundant timeline creation requests for the env.initial_timeline) -- I think that's a pageserver bug, perhaps a regression from recent changes to timeline lifecycle stuff for offloading.

If we get to the bottom of that, then revert that bump to the timeout.

@hlinnaka
Copy link
Contributor

The test environment is initialized with:

    env = neon_env_builder.init_configs()
    env.start()
    tenant_id: TenantId = env.initial_tenant
    timeline_id = env.initial_timeline

That doesn't create the initial tenant/timeline, it merely generates random IDs for them. So one of the 'tenant_create' and 'timeline_create' calls in the main loop will succeed to create the tenant and timeline.

@hlinnaka
Copy link
Contributor

Hmm, I see two initdb calls in the pageserver log though, for the first iteration of the loop. One of them is apparently cancelled, (I added debug print here and it was printed), and the second one succeeds and creates the timeline. Weird...

hlinnaka added a commit that referenced this issue Nov 13, 2024
This test was seen to be flaky, e.g. at:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9457/11804246485/index.html#suites/ec4311502db344eee91f1354e9dc839b/982bd121ea698414/.
If I _reduce_ the timeout from 10s to 8s on my laptop, it reliably hits
that timeout and fails. That suggests that the test is pretty close to
the edge even when it passes. Let's bump up the timeout to 30 s to make
it more robust.

See also #9730, although the
error message is different there.
@skyzh
Copy link
Member

skyzh commented Nov 14, 2024

what's the expected behavior of the test before the restart? looking at the log, it seems that the initial tenant/timeline is not created successfully,

2024-11-12T08:52:57.245720Z  INFO attach{tenant_id=1cf8f19e855474304dcc9fbb62ce17e4 shard_id=0000 gen=00000001}: Attaching tenant attach_mode=Single
2024-11-12T08:52:57.245800Z  INFO attach{tenant_id=1cf8f19e855474304dcc9fbb62ce17e4 shard_id=0000 gen=00000001}:preload: listing remote timelines
2024-11-12T08:52:57.248838Z  INFO request{method=POST path=/v1/tenant/1cf8f19e855474304dcc9fbb62ce17e4/timeline request_id=2cdcc662-0089-4598-8c88-c68b5ce2097b}: Handling request
2024-11-12T08:52:57.250055Z  INFO attach{tenant_id=1cf8f19e855474304dcc9fbb62ce17e4 shard_id=0000 gen=00000001}:preload: found 0 timelines, and no manifest
2024-11-12T08:52:57.250201Z  INFO attach{tenant_id=1cf8f19e855474304dcc9fbb62ce17e4 shard_id=0000 gen=00000001}: Done
2024-11-12T08:52:58.741720Z  INFO attach{tenant_id=1cf8f19e855474304dcc9fbb62ce17e4 shard_id=0000 gen=00000002}: Attaching tenant attach_mode=Single
2024-11-12T08:52:58.741783Z  INFO attach{tenant_id=1cf8f19e855474304dcc9fbb62ce17e4 shard_id=0000 gen=00000002}:preload: listing remote timelines
2024-11-12T08:52:58.742722Z  INFO attach{tenant_id=1cf8f19e855474304dcc9fbb62ce17e4 shard_id=0000 gen=00000002}:preload: found 0 timelines, and no manifest

I think we should fix the test by:

  • either, we want the timeline to be fully set up before restarting, there should be no more issues
  • or, we want to test the case "what happens if the tenant/timeline is not fully set up" during restarts

@skyzh skyzh linked a pull request Nov 14, 2024 that will close this issue
@skyzh
Copy link
Member

skyzh commented Nov 14, 2024

it seems that the initial tenant/timeline is not created successfully,

not created successfully b/c the pageserver receives a shutdown request immediately after the request is being handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants