Skip to content

Commit

Permalink
Fill the Alternatives section
Browse files Browse the repository at this point in the history
  • Loading branch information
krysal committed Jun 20, 2023
1 parent a65144a commit 54e14d4
Showing 1 changed file with 21 additions and 8 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,10 @@ This document describes the addition of two DAGs for Elasticsearch (ES) index
creation ––full and proportional-by-provider–– which will allow us to decouple
the process from the long Ingestion server's data refresh process and experiment
with smaller indices. Also includes the adoption of two new index aliases for
ease of handling the new index types.
ease of handling and querying the new index types from the API with the
[`internal__index`](api_ii_param) param.

[api_ii_param]: https://github.com/WordPress/openverse/pull/2073

## Expected Outcomes

Expand Down Expand Up @@ -113,8 +116,11 @@ database fully indexed, as the `source_index` for the ES
5. Iterate over the items of the resulting dictionary to index the subset of
each provider.

```json
```
POST _reindex?wait_for_completion=false
```

```json
{
"max_docs": num_items,
"source": {
Expand All @@ -132,17 +138,24 @@ POST _reindex?wait_for_completion=false
```

6. Make the alias `<media>-subset-by-provider` point to the new index.
7. Optionally. Query the stats of the resulting infex and print the results.
7. Optionally. Query the stats of the resulting index and print the results.

```
GET /image-reindexed-by-provider/_stats
```

## Alternatives

<!-- Describe any alternatives considered and why they were not chosen or recommended. -->
### Combining both DAGs into one

One alternative to creating two different indices by separate is to create the
proportional by provider index using the Ingestion server. This would require
modifying the REINDEX task of the ingestion server or creating a new one that
takes only a subset of the providers by the indicated proportion.

💭
However, I discarded this option in favor of the one explained above because
having both DAGs is much simpler and provides more possibilities for the
creation of different indexes, which is the end goal of the project.

## Parallelizable streams

Expand All @@ -161,16 +174,16 @@ There is nothing currently blocking the implementation of this proposal.

<!-- How do we roll back this solution in the event of failure? Are there any steps that can not easily be rolled back? -->

🤔
We can discard the DAGs if the results are not as expected.

## Risks

<!-- What risks are we taking with this solution? Are there risks that once taken can’t be undone?-->

Elasticsearch does not impose any limit on the amount of indices one can create
but naturally they come with a cost. We don't have policies for creating or
deleting indices by the time being so we should monitor if we reach a point
where this impact the cluster performance.
deleting indices for the time being so we should monitor if we reach a point
where having many indexes impact the cluster performance.

## Prior art

Expand Down

0 comments on commit 54e14d4

Please sign in to comment.