[BUG] FailEngine does not trigger ShardFailure #12006

jrj0823 · 2024-01-24T07:02:31Z

Describe the bug

In production environment, we find our cluster can not recover from a disk failure.
Although the cluster is green, bulk_requests on some shard fail because of AlreadyClosedException.

It seems like the InternalEngine is closed, but ShardFailure is not sent to master.
The root cause is that when failEngine is called by a FlushFailure, markStoreCorrupted() in failEngine() throws one java.nio.file.DirectoryIteratorException that can't be caught, as DirectoryIteratorException is not a IOException.

DirectoryIteratorException is directly thrown out, and eventListener.onFailedEngine(reason, failure); can't be executed

The result is that the index can't process bulkRequest, and master won't reroute the this shard as shardFailure is not reported.

Related component

Cluster Manager

To Reproduce

Manually mock a DirectoryIteratorException when markStoreCorrupted() is called

Expected behavior

ShardFailure can be sent to master when filesystem throws DirectoryIteratorException during failEngine.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

shwetathareja · 2024-01-24T08:44:29Z

Thanks @jrj0823 for raising the issue and pull request. will get back on it.

jrj0823 added bug Something isn't working untriaged labels Jan 24, 2024

github-actions bot added the Cluster Manager label Jan 24, 2024

jrj0823 linked a pull request Jan 24, 2024 that will close this issue

Fix missing shardFailure when filesystem throw exception #12007

Open

8 tasks

shwetathareja removed the untriaged label Jan 24, 2024

rwali-aws added this to Cluster Manager Project Board Apr 20, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] FailEngine does not trigger ShardFailure #12006

[BUG] FailEngine does not trigger ShardFailure #12006

jrj0823 commented Jan 24, 2024 •

edited

Loading

shwetathareja commented Jan 24, 2024

[BUG] FailEngine does not trigger ShardFailure #12006

[BUG] FailEngine does not trigger ShardFailure #12006

Comments

jrj0823 commented Jan 24, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

shwetathareja commented Jan 24, 2024

jrj0823 commented Jan 24, 2024 •

edited

Loading