Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segment replication failure #16149

Open
disaster37 opened this issue Oct 1, 2024 · 2 comments
Open

[BUG] Segment replication failure #16149

disaster37 opened this issue Oct 1, 2024 · 2 comments
Labels
bug Something isn't working Storage:Remote

Comments

@disaster37
Copy link

Describe the bug

After enable Segmentation replication, I get some error logs about segmentation replication failed.
It seems there are link with ISM policy, when force_merge action occur to put only one segment per shard.

The logs:

{"type": "json_logger", "timestamp": "2024-09-30T18:58:29,990Z", "level": "ERROR", "component": "o.o.i.r.SegmentReplicationTargetService", "cluster.name": "logmanagement2-prd", "node.name": "opensearch-hot2-os-1", "message": "[shardId [.ds-logs-log-default-000016][0]] [replication id 220112] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=0, FILE_DIFF=31, REPLICATING=0, GET_FILES=0}", "cluster.uuid": "ZVJvcMA3TmK00qqKawBTjg", "node.id": "SwXDwI_ZSQWvhLuUisEhWA" , 
"stacktrace": ["org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed",
"at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:602) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-2.16.0.jar:2.16.0]",
"at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:84) [opensearch-core-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.16.0.jar:2.16.0]",
"at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) [?:?]",
"at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:58) [opensearch-core-2.16.0.jar:2.16.0]",
"at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleResponse(SecurityInterceptor.java:404) [opensearch-security-2.16.0.0.jar:2.16.0.0]",
"at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1505) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.transport.NativeMessageHandler.doHandleResponse(NativeMessageHandler.java:427) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.transport.NativeMessageHandler.lambda$handleResponse$3(NativeMessageHandler.java:421) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:882) [opensearch-2.16.0.jar:2.16.0]",
"at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]",
"at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]",
"at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]",
"Caused by: org.opensearch.indices.replication.common.ReplicationFailedException: java.nio.file.NoSuchFileException: /usr/share/opensearch/data/nodes/0/indices/bpdiNAx7SE65L0gmSvzrHg/0/index/_1a93.si",
"at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:326) ~[opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$3(SegmentReplicationTarget.java:194) ~[opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-2.16.0.jar:2.16.0]",
"... 20 more",
"Caused by: java.nio.file.NoSuchFileException: /usr/share/opensearch/data/nodes/0/indices/bpdiNAx7SE65L0gmSvzrHg/0/index/_1a93.si",
"at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]",
"at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]",
"at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]",
"at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:224) ~[?:?]",
"at java.base/java.nio.channels.FileChannel.open(FileChannel.java:309) ~[?:?]",
"at java.base/java.nio.channels.FileChannel.open(FileChannel.java:369) ~[?:?]",
"at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.opensearch.index.store.FsDirectoryFactory$HybridDirectory.openInput(FsDirectoryFactory.java:183) ~[opensearch-2.16.0.jar:2.16.0]",
"at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:156) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.apache.lucene.codecs.lucene99.Lucene99SegmentInfoFormat.read(Lucene99SegmentInfoFormat.java:94) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:411) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:368) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:315) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.opensearch.index.store.Store.buildSegmentInfos(Store.java:820) ~[opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:295) ~[opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$3(SegmentReplicationTarget.java:194) ~[opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-2.16.0.jar:2.16.0]",
"... 20 more",
"Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (d0caa28b). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(Snapshot of SegmentInfos))",
"\tat org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:501) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"\tat org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:375) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"\tat org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:315) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"\tat org.opensearch.index.store.Store.buildSegmentInfos(Store.java:820) ~[opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:295) ~[opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$3(SegmentReplicationTarget.java:194) ~[opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.16.0.jar:2.16.0]",
"\tat java.base/java.util.ArrayList.forEach(ArrayList.java:1596) [?:?]",
"\tat org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:58) [opensearch-core-2.16.0.jar:2.16.0]",
"\tat org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleResponse(SecurityInterceptor.java:404) [opensearch-security-2.16.0.0.jar:2.16.0.0]",
"\tat org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1505) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.transport.NativeMessageHandler.doHandleResponse(NativeMessageHandler.java:427) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.transport.NativeMessageHandler.lambda$handleResponse$3(NativeMessageHandler.java:421) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:882) [opensearch-2.16.0.jar:2.16.0]",
"\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]",
"\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]",
"\tat java.base/java.lang.Thread.run(Thread.java:1583) [?:?]"] }

Related component

Indexing:Replication

To Reproduce

  1. Deploy Opensearch cluster with 3 nodes 2.16.0
  2. Enable Segmentation replication on cluster (cluster.indices.replication.strategy: 'SEGMENT')
    I have also enable replication back pressure (segrep.pressure.enabled)
  3. Create ISM policy to rolleover index after primery shard have 50Go. Start force_merge action with 1 segment after rolleover
  4. Create datastream index where ISM policy is affected on

When force_merge action is run, you will get the logs bellow. More over, the force_merge action will go on timeout because of replica shard never have one segment (I look that with (GET _cat/segments/.ds-logs-log-default-000016?v)

Expected behavior

Segmentation replication on replica work when force_merge run.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@disaster37 disaster37 added bug Something isn't working untriaged labels Oct 1, 2024
@github-actions github-actions bot added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Oct 1, 2024
@bharath-techie bharath-techie removed the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Oct 14, 2024
@bharath-techie
Copy link
Contributor

bharath-techie commented Oct 14, 2024

Moving to storage remote

[Triage attendees - 1 2 3 4]

@gbbafna
Copy link
Collaborator

gbbafna commented Oct 17, 2024

[Storage Triage] - Attendees 1 2 3 4

Thanks @disaster37 for filing this issue

  1. Are you observing this issue repeatedly ?
  2. Can you please verify your logs if segment replication is failing before force merge as well ? You can use cat segment replication apis to find out the replication lag as well .

@gbbafna gbbafna removed the untriaged label Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Remote
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants