Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SegmentReplicationRelocationIT.testPrimaryRelocation flaky test failure #10029

Closed
dreamer-89 opened this issue Sep 13, 2023 · 1 comment · Fixed by #10701
Closed

[BUG] SegmentReplicationRelocationIT.testPrimaryRelocation flaky test failure #10029

dreamer-89 opened this issue Sep 13, 2023 · 1 comment · Fixed by #10701
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep Indexing Indexing, Bulk Indexing and anything related to indexing >test-failure Test failure from CI, local build, etc.

Comments

@dreamer-89
Copy link
Member

dreamer-89 commented Sep 13, 2023

Coming from #8279 (comment), SegmentReplicationRelocationIT.testPrimaryRelocation is flaky.

Build with test failures: (24158,25031,25325,25325)

https://build.ci.opensearch.org/job/gradle-check/25325/testReport/org.opensearch.indices.replication/SegmentReplicationRelocationIT/testPrimaryRelocation/

./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationRelocationIT.testPrimaryRelocation" -Dtests.seed=ED1E8CCE7D7E124E -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nb-NO -Dtests.timezone=Universal -Druntime.java=20

Code assertion trip

java.lang.AssertionError: shard [test-idx-1][0], node[YFaFseUBQ22NAic2tf48Jw], relocating [IjN5va7VRbO1qpzpYNmImQ], [P], s[RELOCATING], a[id=T4OogwaCT4C05CjkmnGNBQ, rId=uzJujhPHQqKdJk81axlU-g], expected_shard_size[230] is not a primary shard in primary mode
	at __randomizedtesting.SeedInfo.seed([ED1E8CCE7D7E124E]:0)
	at org.opensearch.index.shard.IndexShard.assertPrimaryMode(IndexShard.java:2532)
	at org.opensearch.index.shard.IndexShard.getReplicationGroup(IndexShard.java:3269)
	at org.opensearch.indices.replication.SegmentReplicationSourceHandler.sendFiles(SegmentReplicationSourceHandler.java:149)
	at org.opensearch.indices.replication.OngoingSegmentReplications.startSegmentCopy(OngoingSegmentReplications.java:123)
	at org.opensearch.indices.replication.SegmentReplicationSourceService$GetSegmentFilesRequestHandler.messageReceived(SegmentReplicationSourceService.java:149)
	at org.opensearch.indices.replication.SegmentReplicationSourceService$GetSegmentFilesRequestHandler.messageReceived(SegmentReplicationSourceService.java:146)
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106)
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:454)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1623)
@dreamer-89 dreamer-89 added bug Something isn't working >test-failure Test failure from CI, local build, etc. untriaged flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Sep 13, 2023
@kotwanikunal kotwanikunal added the Indexing Indexing, Bulk Indexing and anything related to indexing label Sep 19, 2023
@gbbafna
Copy link
Collaborator

gbbafna commented Sep 27, 2023

I see a different stack trace for this failure now

[2023-09-26T10:27:38,324][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t1] [shardId [test-idx-1][0]] [replication id 253] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=28, FILE_DIFF=3, REPLICATING=0, GET_FILES=1}
org.opensearch.indices.replication.common.ReplicationFailedException: Store corruption during replication
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:525) [main/:?]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:84) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [main/:?]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) [main/:?]
	at org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$4.onResponse(ActionListener.java:182) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) [main/:?]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) [main/:?]
	at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleResponse(TraceableTransportResponseHandler.java:73) [main/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1493) [main/:?]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:394) [main/:?]
	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:388) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.lang.Thread.run(Thread.java:1623) [?:?]
Caused by: org.opensearch.OpenSearchCorruptionException: java.nio.file.NoSuchFileException: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationRelocationIT_251C3100A4B6F0D0-001/tempDir-002/node_t1/nodes/0/indices/UIuUJuTpQM2-n-YI2FrpMw/0/index/_w.si
	at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:289) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$3(SegmentReplicationTarget.java:178) ~[main/:?]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	... 23 more

https://build.ci.opensearch.org/job/gradle-check/26306/testReport/junit/org.opensearch.indices.replication/SegmentReplicationRelocationIT/testPrimaryRelocation/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep Indexing Indexing, Bulk Indexing and anything related to indexing >test-failure Test failure from CI, local build, etc.
Projects
None yet
4 participants