-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky test SegmentReplicationRemoteStoreIT.testPressureServiceStats #8827
Conversation
…ds when computing replication stats. Signed-off-by: Marc Handalian <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
…. The API should still show a result for the replica even if it has not sync'd. Signed-off-by: Marc Handalian <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #8827 +/- ##
============================================
+ Coverage 70.76% 70.89% +0.13%
- Complexity 57111 57155 +44
============================================
Files 4771 4771
Lines 270241 270251 +10
Branches 39500 39504 +4
============================================
+ Hits 191237 191598 +361
+ Misses 62846 62551 -295
+ Partials 16158 16102 -56
|
This looks like a bug introduced with #8715 - looking at this and re-opening that issue. |
What is the cluster state at this point ? Is the other replica active ? |
@mch2 : How flaky was this test before this fix ? Does your fix completely fixes the flakyness of this issue ? |
Hmm I didn't get a dump of cluster state here but the new primary is the only allocated shard, there is no additional replica. @mch2 : How flaky was this test before this fix ? Does your fix completely fixes the flakyness of this issue ? This was one of the more rare ones according to #8279 - I ran it about 2-3k times on repeat after this change and I don't see it again, but I've seen this situation before where its hard to reproduce outside of CI. The other test that is linked here testPrimaryRelocation I have a follow up PR to fix this one but the failure is unrelated to this fix. |
…ts (#8827) * Fix ReplicationTracker to not include unavailable former primary shards when computing replication stats. Signed-off-by: Marc Handalian <[email protected]> * Fix relocation IT relying on stats to determine if segrep has occured. The API should still show a result for the replica even if it has not sync'd. Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]> (cherry picked from commit a3baa68) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…ts (#8827) (#8855) * Fix ReplicationTracker to not include unavailable former primary shards when computing replication stats. * Fix relocation IT relying on stats to determine if segrep has occured. The API should still show a result for the replica even if it has not sync'd. --------- (cherry picked from commit a3baa68) Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…ts (opensearch-project#8827) * Fix ReplicationTracker to not include unavailable former primary shards when computing replication stats. Signed-off-by: Marc Handalian <[email protected]> * Fix relocation IT relying on stats to determine if segrep has occured. The API should still show a result for the replica even if it has not sync'd. Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]>
…ts (opensearch-project#8827) * Fix ReplicationTracker to not include unavailable former primary shards when computing replication stats. Signed-off-by: Marc Handalian <[email protected]> * Fix relocation IT relying on stats to determine if segrep has occured. The API should still show a result for the replica even if it has not sync'd. Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: Kaushal Kumar <[email protected]>
…ts (opensearch-project#8827) * Fix ReplicationTracker to not include unavailable former primary shards when computing replication stats. Signed-off-by: Marc Handalian <[email protected]> * Fix relocation IT relying on stats to determine if segrep has occured. The API should still show a result for the replica even if it has not sync'd. Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: Ivan Brusic <[email protected]>
…ts (opensearch-project#8827) * Fix ReplicationTracker to not include unavailable former primary shards when computing replication stats. Signed-off-by: Marc Handalian <[email protected]> * Fix relocation IT relying on stats to determine if segrep has occured. The API should still show a result for the replica even if it has not sync'd. Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: Shivansh Arora <[email protected]>
Description
This test was marked as non flaky recently, but I was able to catch it while testing a fix for #8059. This test occasionally fails on an assertion on segrep stats after a failover event. The assertion is checking that there are no stats returned right after the event because the new primary has no allocated replicas.
This assertion trips occasionally because the former primary is not yet removed as an in-sync allocation id within the replication group, and is included in the result set. Further we are computing and returning checkpoint timers for a shard that would never catch up as its about to be removed.
This PR fixes this by excluding allocation ids that are marked in-sync but are unavailable in the replication group, meaning they do not have any routing entry.
Related Issues
#7592
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.