Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FailEngine does not trigger ShardFailure #12006

Open
jrj0823 opened this issue Jan 24, 2024 · 1 comment · May be fixed by #12007
Open

[BUG] FailEngine does not trigger ShardFailure #12006

jrj0823 opened this issue Jan 24, 2024 · 1 comment · May be fixed by #12007
Labels
bug Something isn't working Cluster Manager

Comments

@jrj0823
Copy link

jrj0823 commented Jan 24, 2024

Describe the bug

In production environment, we find our cluster can not recover from a disk failure.
Although the cluster is green, bulk_requests on some shard fail because of AlreadyClosedException.
alreadyclosed

It seems like the InternalEngine is closed, but ShardFailure is not sent to master.
The root cause is that when failEngine is called by a FlushFailure, markStoreCorrupted() in failEngine() throws one java.nio.file.DirectoryIteratorException that can't be caught, as DirectoryIteratorException is not a IOException.
failshard2

DirectoryIteratorException is directly thrown out, and eventListener.onFailedEngine(reason, failure); can't be executed

The result is that the index can't process bulkRequest, and master won't reroute the this shard as shardFailure is not reported.

Related component

Cluster Manager

To Reproduce

  1. Manually mock a DirectoryIteratorException when markStoreCorrupted() is called

Expected behavior

ShardFailure can be sent to master when filesystem throws DirectoryIteratorException during failEngine.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@jrj0823 jrj0823 added bug Something isn't working untriaged labels Jan 24, 2024
@jrj0823 jrj0823 linked a pull request Jan 24, 2024 that will close this issue
8 tasks
@shwetathareja
Copy link
Member

Thanks @jrj0823 for raising the issue and pull request. will get back on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

2 participants