Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v24.2.x] rm_stm: fix a race during partition shutdown #24938

Merged

Conversation

vbotbuildovich
Copy link
Collaborator

Backport of PR #24936

Currently apply fiber can continue to run (and possibly add new
producers to _producers map) as the state machine is shutting down.
This can manifest in weird crashes as the clean up destroys the
_producers without deregistering properly.

First manifestation

Iterator invalidation in reset_producers() as it loops thru _producers
with scheduling points while state machine apply adds new producers

future<> rm_stm::stop() {
.....
    co_await _gate.close();
    co_await reset_producers();  <---- interferes with state machine apply
    _metrics.clear();
    co_await raft::persisted_stm<>::stop();
.....

Second manifestation

Crashes: every producer creation registers with an intrusive list in
producer_state_manager using a safe link. Now, if a new producer is
registered after reset_producers, the map is destroyed in the state
machine destructor without unlinking from the producer_state_manager
and the safe_link fires an assert.

This bug has been there forever from what I can tell, perhaps got
worsened with recent changes that added more scheduling points in the
surrounding code.

(cherry picked from commit fb57ccd)
(cherry picked from commit 873b282)
@vbotbuildovich vbotbuildovich added this to the v24.2.x-next milestone Jan 27, 2025
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Jan 27, 2025
@vbotbuildovich
Copy link
Collaborator Author

Retry command for Build#61213

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/scaling_up_test.py::ScalingUpTest.test_fast_node_addition

@vbotbuildovich
Copy link
Collaborator Author

CI test results

test results on build#61213
test_id test_kind job_url test_status passed
rptest.tests.scaling_up_test.ScalingUpTest.test_fast_node_addition ducktape https://buildkite.com/redpanda/redpanda/builds/61213#0194a7bc-fc70-4b10-ae96-fc5aabf0e61c FAIL 0/1

@lf-rep lf-rep merged commit e99985d into redpanda-data:v24.2.x Jan 27, 2025
15 of 18 checks passed
@BenPope BenPope modified the milestones: v24.2.x-next, v24.2.17 Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants