-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Meta] Segment Replication flaky test failures #8279
Comments
|
I thought @Rishikesh1159 already fixed the testNodeDropWithOngoingReplication flaky test #8441. |
@anasalkouz my PR #8441 to fix testNodeDropWithOngoingReplication flaky test was merged 5 days ago on 5th July. Any PR opened before the fix or that didn't rebase their branch with latest fix will see this flaky test. |
@kotwanikunal did you pulled testNodeDropWithOngoingReplication fix before your run? or there is something else need to be fixed? |
The fix was first added with gradle run 19120 and the checks are from 18400 to 19300. |
Recent runs from 19000 to 20900 (removed failures <= 2 for brevity).
|
Tests with failures:
|
More tests with failures: |
Created issue for |
Created a separate meta for remote store only tests - #9467. We can continue to use this issue for SR related IT failures cc @sachinpkale @Bukhtawar. |
Tried another run with 24000 start id (~12 days back) and below are tests which are still failing occasionally on gradle check related to segment replication. Created tracking issues for top 5. CC @sachinpkale @mch2 @anasalkouz
Complete flaky report
|
Do we need to keep tracking them here? seems they failed very few times. I don't think those should be considered as high priority. |
Thanks @anasalkouz for the suggestion. Yes, failures have reduced drastically when compared to previous run. Yes, we can track these issues separately owning to lower rate of failures. I did a quick scan and observe doc-count mis-match assertion trips in all cases except one #10029 where relocating primary shard continues to perform round of segment replication while not in primary mode (this is case of node-node communication). This is not problematic as replica should start replication with new primary and cancel ongoing replication with older primary. The issues failing with doc count mis-match needs more deep dive! As suggested, I am closing this issue and flaky tests can be tracked separately. |
I hit one that was in the above list again, opened #10303. |
Meta issue to track flaky test failures related to segment replication. This issue is to track recent surge in flaky test failures post remote store integration where all existing segment replication integration tests are run with remote store. Below report shows the top hitters.
Linking some of existing open issues below. We can start with ones which are top hitters based on numbers above
org.opensearch.remotestore.RemoteStoreIT.testStaleCommitDeletionWithInvokeFlush
andorg.opensearch.remotestore.RemoteStoreIT.testStaleCommitDeletionWithoutInvokeFlush
are flaky #8658Related: #5669
The text was updated successfully, but these errors were encountered: