Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASSANDRA-20248: Fix failing repair test #3828

Open
wants to merge 1 commit into
base: cep-15-accord
Choose a base branch
from

Conversation

ifesdjeen
Copy link
Contributor

No description provided.

Patch by Alex Petrov; reviewed by David Capwell for CASSANDRA-20248.
@ifesdjeen ifesdjeen requested a review from dcapwell January 24, 2025 10:59
@@ -150,6 +150,8 @@ public static void setupCluster() throws IOException
if (throwable.getClass().toString().contains("InstanceShutdown") || // can't check instanceof as it is thrown by a different classloader
throwable.getMessage() != null && throwable.getMessage().contains("Parent repair session with id"))
return true;
if (throwable.getMessage().contains("Corrupted: Corrupted"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i don't get, why is this passing in trunk and been stable there, but cep-15-accord is unstable? What is unique about cep-15-accord that makes this unstable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fact this error is making it to this code path means we are not handling the error, and we have been since 4.0... why is cep-15-accord not handling this error anymore?

Copy link
Contributor

@dcapwell dcapwell Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the root cause. CASSANDRA-20069 removed the error handling that this test depends on, so we need to revert that to get the test passing, not telling the test to ignore this exception.

jvm-dtest should do

# org.apache.cassandra.distributed.impl.AbstractCluster#uncaughtExceptions
InstanceClassLoader cl = (InstanceClassLoader) thread.getContextClassLoader();
get(cl.getInstanceId()).uncaughtException(thread, error);

this causes us to hit our failure detection logic, which leads to C* to halt the JVM due to stability issues (what this test is testing). Without that logic org.apache.cassandra.distributed.impl.InstanceKiller$InstanceShutdown doesn't get thrown and this test fails (because we stopped handling the exception that this test is trying to make sure we do)

Copy link
Contributor

@dcapwell dcapwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test looks to show that we stopped handling an exception we are supposed to handle, so telling the test to ignore it seems like the wrong behavior.

@ifesdjeen
Copy link
Contributor Author

@dcapwell could you take another look? We do of course trigger killCurrentJVM. This logic was changed and improved, so we simply report this exception in addition to catching it:

                    AbstractCluster.this.uncaughtException(t, e);
                    stabilityInspector.accept(e);

@belliottsmith belliottsmith force-pushed the cep-15-accord branch 2 times, most recently from d7764b3 to fee0e64 Compare February 3, 2025 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants