Feature proposal: Poll the tasks api for migrations which call async reindexing endpoints #342

a-del-devision · 2024-08-22T15:55:27Z

Reasoning

The endpoints in the reindexing module (_reindex, _update_by_query, _delete_by_query) can sometimes take a lot of time (minutes) to return a response depending on the request. This can lead to timeouts being hit by the RestClient if the configured socket timeout is too retstrictive. However on some environments configuring the socket timeout for the client is not enough - for example the AWS OpenSearch Service is sitting behind a load balancer with a timeout which is not configurable: https://repost.aws/knowledge-center/opensearch-http-504-gateway-timeout

This can lead to the case where a migration sends an expensive _update_by_query request, which does not yield a response before the load balancer hits the timeout, then the client receives a 504 response and marks the migration as failed. On subsequent runs, ElasticSearch Evolution tries again to send the request, which can now also fail due to conflicts since the initial request processing has not yet completed, which in turn agains fails the migration. This way the migration ends up being not executable if the request can't be processed fast enough.

This can be somewhat mitigated by adding the wait_for_completion=false parameter to the http request which will cause an immediate response with a task id and the request will be processed asynchronously. However this does not remove the possibility of running into conflicts when executing subsequent migrations which also update documents in the same index before the asynchronous request has finished processing.

A solution here would be to poll the _tasks endpoint for the given task until it completes successfully, or fail the migration if completes with errors. The proposal is to add the possibility for ElasticSearch evolution to do this automatically.

Proposal

If enabled, when executing migrations, if the defined request targets any of the reindexing module endpoints (_update_by_query, _delete_by_query, _reindex), and the wait_for_completion=false parameter is present, ElasticSearch Evolution will start polling the _tasks/<task_id> endpoint with the given task id with a configurable poll interval until the endpoint returns a response indicating the task has completed or until a configurable timout is reached. This can be configured with the following configuration properties for example:

spring.elasticsearch.evolution.await-task-completion which is false by default and can be used to enable this feature.
spring.elasticsearch.evolution.task-poll-interval which defines the poll interval ms when polling the _tasks endpoint.
spring.elasticsearch.evolution.task-timeout which defines the timeout period in ms after which polling of the _tasks will stop and the migration will be considered failed.

The _tasks endpoint does not provide explicit information whether a task has completed successfully or not, so the following logic can be used:

the completed field is true, if not wait until the next poll interval
the error field does not exist (it is added to the response if for example a painless script was provided which does not compile), if it exists then mark the migration as failed
the response.failures field is empty (it could contain for example conflict exceptions), if it is not empty then mark the migration as failed
if the response satisifes all 3 above, mark the migration as successful

This is also the reason why this feature is limited to the reindexing endpoints - the structure of the response field depends on what the task action is, so not all endpoints which support the wait_for_completion parameter will produce tasks with the same structure. Some endpoints don't even return a task id if the wait_for_completion parameter is given, for example the _tasks api waits for the matching tasks to complete before returning a response if this parameter is set to true.

In order to support this functionality for reindexing migrations which are already existing and applied to some environments, but will be applied to other environments in the future, an additional configuration property can be added (spring.elasticsearch.evolution.use-tasks-by-default by default false) which will make ElasticSearch Evolution automatically add the wait_for_completion=false parameter to reindexing migrations when executing them, if the parameter is not explicitly set (regardless if the value is true or false). This will remove the need for manually updating the valid checksums in the history index if someone wants use this feature for already existing migrations.

If you guys accept the general idea and the implementation proposal, I would be happy to provide a PR for this :)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature proposal: Poll the tasks api for migrations which call async reindexing endpoints #342

Feature proposal: Poll the tasks api for migrations which call async reindexing endpoints #342

a-del-devision commented Aug 22, 2024

Feature proposal: Poll the tasks api for migrations which call async reindexing endpoints #342

Feature proposal: Poll the tasks api for migrations which call async reindexing endpoints #342

Comments

a-del-devision commented Aug 22, 2024

Reasoning

Proposal