Skip to content
This repository has been archived by the owner on Aug 21, 2024. It is now read-only.

fix(concurrency): stopping the program if one of the threads panics #2089

Conversation

meship-starkware
Copy link
Contributor

@meship-starkware meship-starkware commented Jul 16, 2024

This change is Reviewable

Copy link
Contributor Author

@meship-starkware meship-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion


crates/blockifier/src/blockifier/transaction_executor.rs line 236 at r1 (raw file):

        // Change the defult way of handling panics to write to stderr.
        // this ensure that the panic massge and location will be printed to stderr.
        // without adding flags to the cargo command.

The upside of this method is that I can print both the location and the error message to stderr without adding a flag.
The downside is that every panic that occurs after this will have this behavior. I think I can solve this if I save the original hook and set it back after all the threads finish their execution.

Copy link
Contributor Author

@meship-starkware meship-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 2 unresolved discussions


crates/blockifier/src/blockifier/transaction_executor.rs line 250 at r1 (raw file):

            for _ in 0..self.config.concurrency_config.n_workers {
                let worker_executor = Arc::clone(&worker_executor);
                s.spawn(move || {

If there are future memory leaks, it might be a place to look at because I am not sure we release the memory of the running threads
when aborting the process.

@meship-starkware meship-starkware requested a review from noaov1 July 16, 2024 08:09
@meship-starkware meship-starkware force-pushed the meship/make_the_main_thread_to_panic_when_one_panic branch from 1ad44db to 42e719e Compare July 16, 2024 08:10
@meship-starkware meship-starkware force-pushed the meship/make_the_main_thread_to_panic_when_one_panic branch from 42e719e to b555dda Compare July 16, 2024 10:06
Copy link
Contributor Author

@meship-starkware meship-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 3 unresolved discussions (waiting on @avi-starkware, @noaov1, and @Yoni-Starkware)


crates/blockifier/src/concurrency/worker_logic.rs line 120 at r3 (raw file):

    fn execute(&self, tx_index: TxIndex) {
        self.execute_tx(tx_index);

Please ignore. It is here to check the Python test. I will remove this when I am done.

@meship-starkware meship-starkware force-pushed the meship/make_the_main_thread_to_panic_when_one_panic branch 4 times, most recently from 7a3ffd6 to 234757c Compare July 16, 2024 11:40
Copy link
Collaborator

@Yoni-Starkware Yoni-Starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 1 unresolved discussion (waiting on @avi-starkware, @meship-starkware, and @noaov1)


crates/blockifier/src/concurrency/worker_logic.rs line 89 at r5 (raw file):

        if self.scheduler.done() {
            return;
        };

What are we gaining here?

Code quote:

        if self.scheduler.done() {
            return;
        };

@meship-starkware meship-starkware force-pushed the meship/make_the_main_thread_to_panic_when_one_panic branch from 234757c to b1b86eb Compare July 16, 2024 14:25
Copy link
Contributor Author

@meship-starkware meship-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 1 unresolved discussion (waiting on @avi-starkware, @noaov1, and @Yoni-Starkware)


crates/blockifier/src/concurrency/worker_logic.rs line 89 at r5 (raw file):

Previously, Yoni-Starkware (Yoni) wrote…

What are we gaining here?

No new threads will take tasks after the program is crushed.

@avi-starkware
Copy link
Collaborator

crates/blockifier/src/concurrency/worker_logic.rs line 89 at r5 (raw file):

Previously, meship-starkware (Meshi Peled) wrote…

No new threads will take tasks after the program is crushed.

This is not needed.
If the done marker is set to true, then all workers will stop when they next ask for a new task.
Entering the commit phase is also not a problem, because try_commit will return None if done_marker is true.

@meship-starkware meship-starkware force-pushed the meship/make_the_main_thread_to_panic_when_one_panic branch from b1b86eb to 8206ffa Compare July 16, 2024 14:32
@avi-starkware
Copy link
Collaborator

crates/blockifier/src/concurrency/scheduler.rs line 239 at r5 (raw file):

    pub fn mark_done(&self) {
        self.done_marker.store(true, Ordering::Release);

done is an implementation detail of the scheduler. Not part of its interface. please revert to private method.
The method halt should be used to prematurely stop the scheduler.
Please rename the method halt, move it below get_n_committed_transactions (all pub methods should be at the top), and use it in the method halt_scheduler of the transaction committer.

Suggestion:

    fn done(&self) -> bool {
        self.done_marker.load(Ordering::Acquire)
    }

    pub fn halt(&self) {
        self.done_marker.store(true, Ordering::Release);

Copy link
Contributor Author

@meship-starkware meship-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 2 unresolved discussions (waiting on @avi-starkware, @noaov1, and @Yoni-Starkware)


crates/blockifier/src/concurrency/worker_logic.rs line 89 at r5 (raw file):

Previously, avi-starkware wrote…

This is not needed.
If the done marker is set to true, then all workers will stop when they next ask for a new task.
Entering the commit phase is also not a problem, because try_commit will return None if done_marker is true.

Done.

@meship-starkware meship-starkware force-pushed the meship/make_the_main_thread_to_panic_when_one_panic branch from 8206ffa to 01f4ae5 Compare July 16, 2024 14:35
Copy link
Contributor Author

@meship-starkware meship-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 2 unresolved discussions (waiting on @avi-starkware, @noaov1, and @Yoni-Starkware)


crates/blockifier/src/concurrency/scheduler.rs line 239 at r5 (raw file):

Previously, avi-starkware wrote…

done is an implementation detail of the scheduler. Not part of its interface. please revert to private method.
The method halt should be used to prematurely stop the scheduler.
Please rename the method halt, move it below get_n_committed_transactions (all pub methods should be at the top), and use it in the method halt_scheduler of the transaction committer.

Done.

@avi-starkware
Copy link
Collaborator

crates/blockifier/src/blockifier/transaction_executor.rs line 247 at r7 (raw file):

                        println!("Thread {} caught a panic, propagating it.", i);
                        worker_executor.scheduler.halt();
                        panic::resume_unwind(err);

I think we need an abort guard.
In case the method halt panics, we won't reach the resume_unwind, so the process needs to be terminated.
To be sure, try to put a panic!() inside the method halt and check if the process ends.

Additionally, I think we don't need the variable result. Just do if let Err(err) = catch_unwind(.... What do you think?

Suggestion:

                    let abort_guard = AbortGuard;
                    let result = catch_unwind(AssertUnwindSafe(|| {
                        worker_executor.run();
                    }));
                    if let Err(err) = result {
                        println!("Worker executor thread panicked. {err:?}: {err}");
                        worker_executor.scheduler.halt();
                        panic::resume_unwind(err);
                        abort_guard.release();

Copy link
Collaborator

@avi-starkware avi-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r5, 1 of 2 files at r6, 2 of 2 files at r7, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @meship-starkware and @noaov1)

@meship-starkware meship-starkware force-pushed the meship/make_the_main_thread_to_panic_when_one_panic branch 2 times, most recently from 7b74bac to 4d21f6d Compare July 17, 2024 11:52
Copy link
Collaborator

@avi-starkware avi-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @meship-starkware and @noaov1)


crates/blockifier/src/concurrency/utils.rs line 12 at r9 (raw file):

impl Drop for AbortGuard {
    fn drop(&mut self) {
        eprintln!("detected unexpected panic; aborting");

Suggestion:

Thread panicked unexpectedly; aborting");

crates/blockifier/src/concurrency/worker_logic.rs line 120 at r3 (raw file):

Previously, meship-starkware (Meshi Peled) wrote…

Please ignore. It is here to check the Python test. I will remove this when I am done.

Please temporarily add

// FIXME: Remove the semicolon before merging

So we won't forget this change. (The semicolon by itself is easy to miss.)

@avi-starkware
Copy link
Collaborator

crates/blockifier/src/concurrency/utils.rs line 12 at r9 (raw file):

impl Drop for AbortGuard {
    fn drop(&mut self) {
        eprintln!("detected unexpected panic; aborting");

Sorry. Please make it Thread panicked unexpectedly; aborting process.

Copy link
Collaborator

@avi-starkware avi-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @meship-starkware and @noaov1)


crates/blockifier/src/blockifier/transaction_executor.rs line 244 at r9 (raw file):

                s.spawn(move || {
                    // to make sure that if one of the threads panicked all threads would stop,
                    // and the main thread would panic.

Suggestion:

                    // If a panic is not handled or the handling logic itself panics,
                    // then we abort the program.

crates/blockifier/src/blockifier/transaction_executor.rs line 249 at r9 (raw file):

                        worker_executor.run();
                    }));
                    if let Err(err) = result {

Suggestion:

                    if let Err(err) =  catch_unwind(AssertUnwindSafe(|| {
                        worker_executor.run();
                    })) {

crates/blockifier/src/blockifier/transaction_executor.rs line 251 at r9 (raw file):

                    if let Err(err) = result {
                        // this make sure that the program will abort if a panic accured
                        // while halting the scheduler.

I think this comment is out of place.

Suggestion:

                    if let Err(err) = result {

crates/blockifier/src/concurrency/utils.rs line 7 at r9 (raw file):

// This stract is used to abort the program if a
// panic ocurred in a place where it cannot be handled.

Suggestion:

// This struct is used to abort the program if a
// panic occurred in a place where it could not be handled.

@meship-starkware meship-starkware force-pushed the meship/make_the_main_thread_to_panic_when_one_panic branch 2 times, most recently from edfafec to 81c1892 Compare July 17, 2024 16:00
Copy link
Contributor Author

@meship-starkware meship-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 5 files reviewed, 5 unresolved discussions (waiting on @avi-starkware and @noaov1)


crates/blockifier/src/blockifier/transaction_executor.rs line 247 at r7 (raw file):

Previously, avi-starkware wrote…

I think we need an abort guard.
In case the method halt panics, we won't reach the resume_unwind, so the process needs to be terminated.
To be sure, try to put a panic!() inside the method halt and check if the process ends.

Additionally, I think we don't need the variable result. Just do if let Err(err) = catch_unwind(.... What do you think?

Done.


crates/blockifier/src/blockifier/transaction_executor.rs line 249 at r9 (raw file):

                        worker_executor.run();
                    }));
                    if let Err(err) = result {

Done.


crates/blockifier/src/blockifier/transaction_executor.rs line 251 at r9 (raw file):

Previously, avi-starkware wrote…

I think this comment is out of place.

Is this Ok?


crates/blockifier/src/concurrency/utils.rs line 7 at r9 (raw file):

// This stract is used to abort the program if a
// panic ocurred in a place where it cannot be handled.

Done.

@meship-starkware meship-starkware force-pushed the meship/make_the_main_thread_to_panic_when_one_panic branch from 81c1892 to 7c36bf8 Compare July 17, 2024 16:18
Copy link
Contributor Author

@meship-starkware meship-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 5 files reviewed, 5 unresolved discussions (waiting on @avi-starkware and @noaov1)


crates/blockifier/src/blockifier/transaction_executor.rs line 244 at r9 (raw file):

                s.spawn(move || {
                    // to make sure that if one of the threads panicked all threads would stop,
                    // and the main thread would panic.

Done.

@avi-starkware
Copy link
Collaborator

crates/blockifier/src/blockifier/transaction_executor.rs line 253 at r12 (raw file):

                        // If the program panics here, the abort guard will exit the program.
                        // In this case, no panic message will be logged. Adding the cargo flag
                        // --nocapture will log the panic messages.

Suggestion:

                        // In this case, no panic message will be logged. Add the cargo flag
                        // --nocapture to log the panic message.

Copy link
Collaborator

@avi-starkware avi-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r10, 1 of 2 files at r11, 1 of 1 files at r12, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @meship-starkware and @noaov1)

@meship-starkware meship-starkware force-pushed the meship/make_the_main_thread_to_panic_when_one_panic branch from 7c36bf8 to 41492f7 Compare July 18, 2024 11:25
Copy link
Contributor Author

@meship-starkware meship-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 5 files reviewed, 1 unresolved discussion (waiting on @avi-starkware and @noaov1)


crates/blockifier/src/blockifier/transaction_executor.rs line 253 at r12 (raw file):

                        // If the program panics here, the abort guard will exit the program.
                        // In this case, no panic message will be logged. Adding the cargo flag
                        // --nocapture will log the panic messages.

Done.

Copy link
Collaborator

@avi-starkware avi-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r13, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @noaov1)

avi-starkware
avi-starkware previously approved these changes Jul 18, 2024
Copy link
Collaborator

@avi-starkware avi-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @noaov1)

Yoni-Starkware
Yoni-Starkware previously approved these changes Jul 18, 2024
Copy link
Collaborator

@Yoni-Starkware Yoni-Starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r5, 1 of 2 files at r7, 1 of 3 files at r10, 1 of 1 files at r13, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @meship-starkware and @noaov1)


crates/blockifier/src/concurrency/utils.rs line 13 at r13 (raw file):

    fn drop(&mut self) {
        eprintln!("detected unexpected panic; aborting");
        ::std::process::abort();

?

Suggestion:

std::process::abort();

@meship-starkware meship-starkware force-pushed the meship/make_the_main_thread_to_panic_when_one_panic branch from 41492f7 to 124aa69 Compare July 18, 2024 11:58
Copy link
Collaborator

@Yoni-Starkware Yoni-Starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r14, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @noaov1)

@meship-starkware meship-starkware merged commit 3afe8b2 into main-v0.13.2 Jul 18, 2024
11 checks passed
@meship-starkware meship-starkware deleted the meship/make_the_main_thread_to_panic_when_one_panic branch July 18, 2024 12:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants