Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] OCTRL-911 Transitions should not be performed concurrently #600

Merged
merged 1 commit into from
Jul 31, 2024

Conversation

knopers8
Copy link
Collaborator

Tested on staging with a few different cases when tasks go to ERROR, the protection works and there are no deadlocks.

@knopers8 knopers8 requested a review from teo July 30, 2024 13:30
teo
teo previously approved these changes Jul 30, 2024
@knopers8 knopers8 changed the title [core] OCTRL-911 Transitions should not be performed concurrently WIP [core] OCTRL-911 Transitions should not be performed concurrently Jul 31, 2024
@knopers8
Copy link
Collaborator Author

Do not merge yet, I managed to cause a segfault with two concurrent DESTROY attempts. While the 2nd one waits until the 1st one succeeds, it sees a nil environment and it crashes. I have to fix that first.

@knopers8
Copy link
Collaborator Author

knopers8 commented Jul 31, 2024

OK, dumb me, it's good now (I checked whether envs.environment() returned error too late).

@knopers8 knopers8 changed the title WIP [core] OCTRL-911 Transitions should not be performed concurrently [core] OCTRL-911 Transitions should not be performed concurrently Jul 31, 2024
@knopers8 knopers8 requested a review from teo July 31, 2024 08:07
@knopers8 knopers8 merged commit 48ea72a into AliceO2Group:master Jul 31, 2024
2 checks passed
@knopers8 knopers8 deleted the fix-concurrent-transitions branch July 31, 2024 12:07
knopers8 added a commit to knopers8/Control that referenced this pull request Jul 31, 2024
This is a fix to avoid running a second teardown attempt after a successful one.
This could have happened if the 2nd teardown request was received while the 1st teardown attempt would be still ongoing.
In principle AliceO2Group#600 was already enough to avoid concurrency issues and I could not make the core misbehave with it, but this commit attempts to make the behaviour somewhat more correct.

I hesitated to set the state as DONE if TeardownEnvironment returns earlier with an error, since this does not remove given environment from the list of enviroments, thus technically it is still available for more teardown attempts.
On another hand, this contradicts the already published kafka events, which advertize the env state as DONE even when it ends with error and the state is not set anyway.
knopers8 added a commit to knopers8/Control that referenced this pull request Jul 31, 2024
This is a fix to avoid running a second teardown attempt after a successful one.
This could have happened if the 2nd teardown request was received while the 1st teardown attempt would be still ongoing.
In principle AliceO2Group#600 was already enough to avoid concurrency issues and I could not make the core misbehave with it, but this commit attempts to make the behaviour somewhat more correct.

I hesitated to set the state as DONE if TeardownEnvironment returns earlier with an error, since this does not remove given environment from the list of enviroments, thus technically it is still available for more teardown attempts.
On another hand, this contradicts the already published kafka events, which advertize the env state as DONE even when it ends with error and the state is not set anyway.
teo pushed a commit that referenced this pull request Aug 1, 2024
This is a fix to avoid running a second teardown attempt after a successful one.
This could have happened if the 2nd teardown request was received while the 1st teardown attempt would be still ongoing.
In principle #600 was already enough to avoid concurrency issues and I could not make the core misbehave with it, but this commit attempts to make the behaviour somewhat more correct.

I hesitated to set the state as DONE if TeardownEnvironment returns earlier with an error, since this does not remove given environment from the list of enviroments, thus technically it is still available for more teardown attempts.
On another hand, this contradicts the already published kafka events, which advertize the env state as DONE even when it ends with error and the state is not set anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants