Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature proposal: Poll the tasks api for migrations which call async reindexing endpoints #342

Open
a-del-devision opened this issue Aug 22, 2024 · 0 comments

Comments

@a-del-devision
Copy link

Reasoning

The endpoints in the reindexing module (_reindex, _update_by_query, _delete_by_query) can sometimes take a lot of time (minutes) to return a response depending on the request. This can lead to timeouts being hit by the RestClient if the configured socket timeout is too retstrictive. However on some environments configuring the socket timeout for the client is not enough - for example the AWS OpenSearch Service is sitting behind a load balancer with a timeout which is not configurable: https://repost.aws/knowledge-center/opensearch-http-504-gateway-timeout

This can lead to the case where a migration sends an expensive _update_by_query request, which does not yield a response before the load balancer hits the timeout, then the client receives a 504 response and marks the migration as failed. On subsequent runs, ElasticSearch Evolution tries again to send the request, which can now also fail due to conflicts since the initial request processing has not yet completed, which in turn agains fails the migration. This way the migration ends up being not executable if the request can't be processed fast enough.

This can be somewhat mitigated by adding the wait_for_completion=false parameter to the http request which will cause an immediate response with a task id and the request will be processed asynchronously. However this does not remove the possibility of running into conflicts when executing subsequent migrations which also update documents in the same index before the asynchronous request has finished processing.

A solution here would be to poll the _tasks endpoint for the given task until it completes successfully, or fail the migration if completes with errors. The proposal is to add the possibility for ElasticSearch evolution to do this automatically.

Proposal

If enabled, when executing migrations, if the defined request targets any of the reindexing module endpoints (_update_by_query, _delete_by_query, _reindex), and the wait_for_completion=false parameter is present, ElasticSearch Evolution will start polling the _tasks/<task_id> endpoint with the given task id with a configurable poll interval until the endpoint returns a response indicating the task has completed or until a configurable timout is reached. This can be configured with the following configuration properties for example:

  • spring.elasticsearch.evolution.await-task-completion which is false by default and can be used to enable this feature.
  • spring.elasticsearch.evolution.task-poll-interval which defines the poll interval ms when polling the _tasks endpoint.
  • spring.elasticsearch.evolution.task-timeout which defines the timeout period in ms after which polling of the _tasks will stop and the migration will be considered failed.

The _tasks endpoint does not provide explicit information whether a task has completed successfully or not, so the following logic can be used:

  • the completed field is true, if not wait until the next poll interval
  • the error field does not exist (it is added to the response if for example a painless script was provided which does not compile), if it exists then mark the migration as failed
  • the response.failures field is empty (it could contain for example conflict exceptions), if it is not empty then mark the migration as failed
  • if the response satisifes all 3 above, mark the migration as successful

This is also the reason why this feature is limited to the reindexing endpoints - the structure of the response field depends on what the task action is, so not all endpoints which support the wait_for_completion parameter will produce tasks with the same structure. Some endpoints don't even return a task id if the wait_for_completion parameter is given, for example the _tasks api waits for the matching tasks to complete before returning a response if this parameter is set to true.

In order to support this functionality for reindexing migrations which are already existing and applied to some environments, but will be applied to other environments in the future, an additional configuration property can be added (spring.elasticsearch.evolution.use-tasks-by-default by default false) which will make ElasticSearch Evolution automatically add the wait_for_completion=false parameter to reindexing migrations when executing them, if the parameter is not explicitly set (regardless if the value is true or false). This will remove the need for manually updating the valid checksums in the history index if someone wants use this feature for already existing migrations.

If you guys accept the general idea and the implementation proposal, I would be happy to provide a PR for this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant