-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation Plan: Staging Elasticsearch Reindex DAGs #2358
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
ec33ee7
Fix broken link
krysal 6152d79
Create implementation plan draft from template
krysal 159732b
Update link
krysal 187ad53
Describe first DAG: `recreate_full_<media>_index`
krysal bd0bd7a
Describe second DAG: `create_proportional_by_provider_<media>_index`
krysal ab35e89
Fill the Alternatives section
krysal 52aafa9
Use absolute paths and fix links
krysal 6cb14e3
Generate DAG docs
krysal 7e986e0
Add `staging` to the name of DAGs and fix typo
krysal 557c983
Add steps for `<media_type>-full` alias creation
krysal e9ea24c
Change the approach to be DAG factories
krysal a6412e4
Rewrite 2nd DAG factory
krysal 6cbd28c
Fix typos
krysal 245ce95
Add reference to `indices.update_aliases` documentation
krysal 82c1b62
Add link to Airflow's Dynamic Task Mapping docs
krysal 3d3aa78
Add note of optionality of new aliases
krysal 98d9fe4
Add note on combining DAGs
krysal 5df190c
Change default source_index
krysal 75fb276
Add note on updating the filtered index and reviewers approval
krysal File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
252 changes: 252 additions & 0 deletions
252
...ancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,252 @@ | ||
# 2023-06-08 Implementation Plan: Staging Elasticsearch Reindex DAGs | ||
|
||
**Author**: @krysal | ||
|
||
## Reviewers | ||
|
||
- [x] @AetherUnbound | ||
- [x] @sarayourfriend | ||
|
||
## Project links | ||
|
||
- [Project Thread](https://github.com/WordPress/openverse/issues/392) | ||
- [Project Proposal](/projects/proposals/search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md) | ||
|
||
## Overview | ||
|
||
This document describes the addition of two DAG factories per media type for | ||
Elasticsearch (ES) index creation ––full and proportional-by-provider–– which | ||
will allow us to decouple the process from the long Ingestion server's data | ||
refresh process and experiment with smaller indices. The DAG factories will | ||
create a DAG per media type supported, which currently consists `image` and | ||
`audio`, limited to staging environment at first. Also includes the adoption of | ||
two new index aliases for ease of handling and querying the new index types from | ||
the API with the [`internal__index`][api_ii_param] param. The use of these | ||
aliases is optional as the resulting indices can continue to be used directly. | ||
|
||
[api_ii_param]: https://github.com/WordPress/openverse/pull/2073 | ||
|
||
## Expected Outcomes | ||
|
||
- DAGs for full recreation of the index that the API uses which includes all the | ||
database contents, by media type. | ||
- DAGs to create indexes of reduced size, proportional-to-production-by-provider | ||
index, by media type. | ||
|
||
## Dependencies | ||
|
||
Same as for | ||
[Implementation Plan: Update Staging Database](/projects/proposals/search_relevancy_sandbox/20230406-implementation_plan_update_staging_database.md). | ||
|
||
This work is related to the [Staging database recreation | ||
DAG][staging_db_recreation] plan for having production volumes in the staging | ||
DB, which is expected to finish soon. | ||
|
||
[staging_db_recreation]: https://github.com/WordPress/openverse/issues/1989 | ||
|
||
## DAGs | ||
|
||
### `recreate_full_<media_type>_staging_index` DAG | ||
|
||
#### Parameters | ||
|
||
1. `point_alias`: (Optional) A boolean value to indicate if the resulting index | ||
will replace the current one pointed by the `<media_type>` alias. If `True`, | ||
then the new index will be the one used by the staging API. Defaults to | ||
`False`. Note: This is different from the "POINT_ALIAS" task of the ingestion | ||
server, as the action performed in this DAG does not always imply the | ||
deletion of the previous index and promotion. | ||
2. `delete_old_if_aliased`: (Optional) A boolean value to indicate if the old | ||
index pointed by the `<media_type>` alias should be deleted after | ||
replacement. Defaults to `False`. | ||
|
||
#### Outlined Steps | ||
|
||
This DAG will leverage the **Ingestion server's API** and use the existing | ||
[`REINDEX` task][reindex], which is the same used by the data refresh process to | ||
create the index. | ||
|
||
1. Get the current timestamp in order to create the index suffix with the form | ||
of `full-<timestamp>`. | ||
2. Use the `ingestion_server.trigger_and_wait_for_task()` utility to send the | ||
`REINDEX` call to ingestion server, passing the previously generated | ||
`index_suffix` in the data payload. | ||
3. Once the index is created, make the alias `<media_type>-full` point to it. | ||
|
||
1. Check if the alias exists. Use the | ||
[`ElasticsearchPythonHook`][es_python_hook] with the | ||
[indices.exists_alias][es_py_exists_alias] function. | ||
|
||
2. If the alias doesn't exist, then it can be created and assigned in one | ||
step using the [`indices.put_alias`][es_py_put_alias] function. | ||
|
||
3. If the alias exists, send the request to add the new index and remove the | ||
old one(s). Get the current index pointed by `<media_type>-full` alias. | ||
The [indices.resolves_index][es_py_resolves_index] function can provide | ||
this information. An alias can be related to multiple indexes but it will | ||
most likely be only one in this case. Then use | ||
[`indices.update_aliases`][es_py_update_aliases] with a body including | ||
both actions, analogous to the [ingest server's task][ing_point_alias]. | ||
|
||
4. If `point_alias=True` is passed, then immediately make the `<media_type>` | ||
alias point to the new index, detaching any other following the same | ||
procedure as indicated above. If `False` then the DAG ends at the previous | ||
step. | ||
5. If the index is aliased then the DAG checks if `delete_old_if_aliased=True` | ||
and proceeds to run [`indices.delete`][es_py_delete] with the old index name. | ||
Otherwise, the DAG ends at the previous step. | ||
|
||
[reindex]: | ||
https://github.com/WordPress/openverse/blob/7427bbd4a8178d05a27e6fef07d70905ec7ef16b/ingestion_server/ingestion_server/indexer.py#L282 | ||
[resolve]: | ||
https://www.elastic.co/guide/en/elasticsearch/reference/7.12/indices-resolve-index-api.html | ||
[es_python_hook]: | ||
https://airflow.apache.org/docs/apache-airflow-providers-elasticsearch/stable/_api/airflow/providers/elasticsearch/hooks/elasticsearch/index.html#airflow.providers.elasticsearch.hooks.elasticsearch.ElasticsearchPythonHook | ||
Comment on lines
+99
to
+104
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This notation is great! I'll definitely be using it in the future 😄 |
||
[es_py_exists_alias]: | ||
https://elasticsearch-py.readthedocs.io/en/v8.8.0/api.html#elasticsearch.client.IndicesClient.exists_alias | ||
[es_py_put_alias]: | ||
https://elasticsearch-py.readthedocs.io/en/v8.8.0/api.html#elasticsearch.client.IndicesClient.put_alias | ||
[es_py_resolves_index]: | ||
https://elasticsearch-py.readthedocs.io/en/v8.8.0/api.html?#elasticsearch.client.IndicesClient.resolve_index | ||
[es_py_update_aliases]: | ||
https://elasticsearch-py.readthedocs.io/en/v8.8.0/api.html#elasticsearch.client.IndicesClient.update_aliases | ||
[ing_point_alias]: | ||
https://github.com/WordPress/openverse/blob/08bb0317e1110694ca4d51058bebbc1dafb4fc13/ingestion_server/ingestion_server/indexer.py#L340 | ||
[es_py_delete]: | ||
https://elasticsearch-py.readthedocs.io/en/v8.8.0/api.html?#elasticsearch.client.IndicesClient.delete | ||
|
||
<!---------------------------------------------------------------------------> | ||
|
||
### `create_proportional_by_provider_<media_type>_staging_index` DAG | ||
|
||
This DAG is intended to be used most likely with the index resulting from the | ||
previous DAG or from the data refresh process, that is, an index with the | ||
database fully indexed, as the `source_index` for the [ES | ||
Reindex][es_reindex_api] API, although the default is set to the filtered | ||
version for maintainers safety. | ||
|
||
[es_reindex_api]: | ||
https://www.elastic.co/guide/en/elasticsearch/reference/7.12/docs-reindex.html | ||
|
||
#### Parameters | ||
|
||
1. `source_index`: (Optional) The existing index on Elasticsearch to use as the | ||
basis for the new index. If not provided, the index aliased to | ||
`<media_type>-filtered` will be used. | ||
krysal marked this conversation as resolved.
Show resolved
Hide resolved
|
||
2. `percentage_of_prod`: A float indicating the proportion of items to take from | ||
each provider from the total amount existing in production. E.g. `0.25` for a | ||
quarter of the production documents. | ||
|
||
> **Note**: Until defining an automated mechanism for creating the filtered | ||
> index on staging, it will need to be recreated often manually or default the | ||
> `source_index` to the `<media_type>-full` alias. | ||
|
||
#### Outlined Steps | ||
|
||
1. Get the list of media count by sources from the production Openverse API | ||
`https://api.openverse.engineering/v1/<media_type>/stats/` | ||
2. Calculate the `total_media` adding up all the counts by provider | ||
3. Build the `dest_index` name with the following format: | ||
|
||
```python | ||
f"{media_type}-{percentage_of_prod}-percent-of-providers-{current_datetime}" | ||
``` | ||
|
||
4. Make a a list of dictionaries mapping all the providers with the required | ||
amount of items for the new index based on the provided `percentage_of_prod` | ||
param. | ||
5. Using the [Dynamic Task Mapping][airflow_dtm] feature of Airflow, expand the | ||
krysal marked this conversation as resolved.
Show resolved
Hide resolved
|
||
providers list to dispatch [`reindex`][es_py_reindex] tasks with the | ||
following params setting to index the subset of each provider in parallel. | ||
|
||
- `wait_for_completion=False` | ||
- `max_docs=docs_num` | ||
- `dest=dest_index` | ||
- ```python | ||
source={ | ||
"index": source_index, | ||
"query": { | ||
"term": { | ||
"source.keyword": provider | ||
} | ||
} | ||
} | ||
``` | ||
|
||
6. Avoiding to `wait_for_completion` will make the previous step return records | ||
of these tasks as documents in an aggregated form. Then here use Sensors and | ||
the previously emitted task IDs to wait for reindex tasks to complete. | ||
7. Once all tasks are finished, trigger an [`indices.refresh`][es_py_refresh] to | ||
make the index queyrable. | ||
8. Make the alias `<media_type>-subset-by-provider` point to the new index. | ||
Follow the same procedure to that of `<media_type>-full` alias of the | ||
previous DAG. | ||
9. Optionally. Query the [stats][es_py_stats] of the resulting index and print | ||
the results. | ||
|
||
[airflow_dtm]: | ||
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html | ||
[es_py_reindex]: | ||
https://elasticsearch-py.readthedocs.io/en/v8.8.0/api.html#elasticsearch.Elasticsearch.reindex | ||
[es_py_refresh]: | ||
https://elasticsearch-py.readthedocs.io/en/v8.8.0/api.html#elasticsearch.client.IndicesClient.refresh | ||
[es_py_stats]: | ||
https://elasticsearch-py.readthedocs.io/en/v8.8.0/api.html#elasticsearch.client.IndicesClient.stats | ||
|
||
## Alternatives | ||
|
||
### Combining both DAGs into one | ||
|
||
One alternative to creating two different indices by separate is to create the | ||
proportional by provider index using the Ingestion server. This would require | ||
modifying the REINDEX task of the ingestion server or creating a new one that | ||
takes only a subset of the providers by the indicated proportion. | ||
|
||
However, I discarded this option in favor of the one explained above because | ||
having both DAGs is much simpler and provides more possibilities for the | ||
creation of different indexes, which is the end goal of the project. | ||
zackkrida marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Combining DAGs for custom index creation into one | ||
|
||
During the discussion of this plan the idea of potencially combining the | ||
`create_proportional_by_provider_<media_type>_staging_index` DAG or both DAGs | ||
defined here with the one described in the plan to | ||
[create indexes with custom configurations](/projects/proposals/search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md). | ||
However, in principle, it is proposed the DAGs defined here only apply to the | ||
staging environment. In addition, following a divide and conquer strategy, we | ||
opted for the ease of developing them separately and combining them later to | ||
simplify once they are all implemented and tested. | ||
|
||
## Parallelizable streams | ||
|
||
Both DAGs can be developed in parallel. | ||
|
||
## Blockers | ||
|
||
There is nothing currently blocking the implementation of this proposal. | ||
|
||
<!-- | ||
## Accessibility | ||
|
||
Are there specific accessibility concerns relevant to this plan? Do you expect new UI elements that would need particular care to ensure they're implemented in an accessible way? Consider also low-spec device and slow internet accessibility, if relevant. --> | ||
|
||
## Rollback | ||
|
||
<!-- How do we roll back this solution in the event of failure? Are there any steps that can not easily be rolled back? --> | ||
|
||
We can discard the DAGs if the results are not as expected. | ||
|
||
## Risks | ||
|
||
<!-- What risks are we taking with this solution? Are there risks that once taken can’t be undone?--> | ||
|
||
Elasticsearch does not impose any limit on the amount of indexes one can create | ||
but naturally they come with a cost. We don't have policies for creating or | ||
deleting indexes for the time being so we should monitor if we reach a point | ||
where having many indexes impact the cluster performance. | ||
|
||
## Prior art | ||
|
||
<!-- Include links to documents and resources that you used when coming up with your solution. Credit people who have contributed to the solution that you wish to acknowledge. --> | ||
|
||
- [Script which added 100 records per provider into the testing ECS database](https://github.com/WordPress/openverse-infrastructure/pull/314) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case we have a previous index pointing to
<media_type>-full
that isn't aliased to<media_type>
as well, should we also delete the old index once the new index is pointed to<media_type>-full
? (Since the old index will essentially be dangling)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of keeping all the indexes (unless specified otherwise) for benchmarking and testing purposes and deleting them manually. They could be switched manually in the API as well, but maybe I'm wrong in this assumption and we want to avoid dangling indexes as much as possible. What do you think? @AetherUnbound @sarayourfriend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, is there basically only every a single
-full
or-subset-by-provider
index? If so, would it make sense to establish at least a baseline expectation that you clear with the team that pre-existing indexes with those aliases are not still being used? Considering the time it takes to create a full index, I would be pretty frustrated if I was using a full index to test something over the course of multiple days but missed any brief window to continue that testing due to AFK or (more likely) being part-time and not responding fast enough to messages.A better approach than automatically deleting the indexes might be to just add an MSR task to check once a week/fortnight/month (whatever cadence we want) whether indexes that do not have the aliases are still being used. That way we don't accidentally disrupt anyone's work and just need to give a message somewhere about a heads up that the alias is going to change to a new index.
The downside of using the alias at all is that it would be easy to miss a message saying the alias changed and spend a long time confused why some test you were running (or something that worked before on previous index settings) suddenly does not work because the new index the alias points to is different in a significant way.
Generally I see the utility in the alias, but process wise, for a distributed and asynchronous team that sometimes moves quickly and other times very slowly on things, I worry they'll introduce more process complicates that wouldn't necessarily compensate for the small quality of life improvement that comes with not having to copy/paste the timestamp of the index you want into your test queries.
To be honest, using the explicit index name is going to be more reliable anyway, especially if indexes aren't going to be automatically deleted out from under you while you're testing something 😅.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In its current form, the proposal allows for multiple -full or -subset-by-provider indexes (btw, good name for the last one! I'll change it to that), unless we keep telling the DAGs to delete the old ones. But because of the arguments you mention, these options are set to False by default.
I totally agree. My idea of including the aliases is to give the option to avoid typing the time stamp (and probably keep targeting the most updated index) when you don't care what specific index is under the hook, besides the characteristics that the alias itself represents. Their use is totally optional. Should I highlight this more clearly in the proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Just a heads up (I'm not entirely sure if this is the situation you're talking about), but if there are multiple indexes under the same alias, then ES will query all those indexes with the alias, not just one. If
logs-2023
andlogs-2022
both have alogs
alias applied to them, when querying withlogs
as the index name, both indexes will be queried and results from both will be included (not sure how scores are compared though). Not sure if that's the situation you were talking about by multiple indexes with the suffix aliases, or if you were talking about the aliases at all 😅There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarayourfriend No, I was referring there can exist many indexes of each type, created as a 'full' or 'subset-by-provider' index. Not that the alias will point to multiple indexes, which I know ES allows but it's not the goal here. The approach is that the new aliases would act similarly to the media aliases,
image
oraudio
. I hope this clarifies it!