Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ADR for incremental algolia indexing #773

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

johnnagro
Copy link
Contributor

Description

A proposed ADR for incremental indexing of content metadata and catalogs into Algolia.

First, the existing indexing process begins with executing catalog queries against `search/all` to determine which
courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the
opposite semantic and instead be able to determine catalog membership from a given course (rather than courses from a
given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a catalog query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] How will this process be aware of courses that exist upstream in course-discovery, but don't yet exist in enterprise-catalog (e.g., a newly added course). Based on this, it would seem enterprise-catalog would need to be aware of all course metadata from course-discovery, even if it's not tied to any enterprise catalogs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing that this is a comment on an older iteration of this ADR however, to answer your question the catalog service would be aware of new content by both hooking up to post save signals and querying all content. For each piece of content that it has been notified about, the service will do it's own query association filtering logic post content ingestion.

Copy link
Contributor

@alex-sheehan-edx alex-sheehan-edx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Mar 15, 2024
@alex-sheehan-edx alex-sheehan-edx force-pushed the johnnagro/ENT-8139/0 branch 3 times, most recently from d1146b2 to 6535cbd Compare March 18, 2024 16:38
@alex-sheehan-edx alex-sheehan-edx self-assigned this Mar 18, 2024
@alex-sheehan-edx alex-sheehan-edx removed the open-source-contribution PR author is not from Axim or 2U label Mar 18, 2024
@alex-sheehan-edx
Copy link
Contributor

For clarity sake- @johnnagro started this work before he left 2U and I'm taking up the task of finishing and publishing

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Mar 18, 2024
@openedx openedx deleted a comment from openedx-webhooks Mar 18, 2024
@alex-sheehan-edx alex-sheehan-edx removed the open-source-contribution PR author is not from Axim or 2U label Mar 18, 2024
@openedx openedx deleted a comment from openedx-webhooks Mar 18, 2024
@openedx-webhooks
Copy link

Thanks for the pull request, @johnnagro! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

⚠️ We can't start reviewing your pull request until you've submitted a signed contributor agreement or indicated your institutional affiliation. Please see the CONTRIBUTING file for more information. If you've signed an agreement in the past, you may need to re-sign. See The New Home of the Open edX Codebase for details.

Once you've signed the CLA, please allow 1 business day for it to be processed. After this time, you can re-run the CLA check by adding a comment here that you have signed it. If the problem persists, you can tag the @openedx/cla-problems team in a comment on your PR for further assistance.

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Mar 18, 2024
@alex-sheehan-edx alex-sheehan-edx requested review from adamstankiewicz and a team March 18, 2024 16:42
Copy link
Contributor

@pwnage101 pwnage101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

halfway done

- Support all current metadata types but doesn’t need to support them all on day 1
- Support multiple methods of triggering: event bus, on-demand from django admin, on a schedule, from the existing
update_content_metadata job, etc.
- Invocation of the new indexing process should not be reliant on separate processes run synchronously before hand.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps s/synchronously/serially ?

Comment on lines +53 to +54
the contents metadata (modified_at) must be bumped from what's previously stored. Secondly, the content must have
associations with queries within the service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the content must have associations with queries in order to kick of an indexing task, what happens when the content had associations before, but then those associations were removed? The end result is that there are no associations, but we still need to kick off an indexing task to de-index the content, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - we would want to kick off the process for a piece of content should it lose/gain any number of associated queries. We need to run the individual indexing task of a course IFF
1- The content metadata in any way changes
2- Any associations between a customer catalog is removed
3- Any associations between a customer catalog is created

We will need to make sure that it is done and evaluated once we go to index

Copy link
Contributor

@pwnage101 pwnage101 Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, somehow this should be represented in the ADR, since right now it calls for a pseudocode which only kicks off indexing for a course IFF the content has associations.

Comment on lines +63 to +67
Incremental updates, through the act of saving individual records, will need to be triggered by something - such as
polling of updated content from Course Discovery, consumption of event-bus events, and/or triggering based on a nightly
Course Discovery crawl or Django Admin button. However it is not the responsibility of the indexer, nor this ADR
to determine when those events should occur, and in fact the indexing process should be able to handle any source of
content metadata record updating processes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a previous paragraph, a solution utilizing a ContentMetadata post_save() hook to trigger a celery task was proposed. Is that a valid solution for triggering incremental index updates? If so, why is it not listed in this paragraph as a solution? Likewise, why aren't solutions in this paragraph listed in the above paragraph?

If the two paragraphs are duplicative, I recommend consolidating them into one.

Copy link
Contributor

@pwnage101 pwnage101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all done!

Comment on lines +54 to +56
contributing factors as to the long run time of the ``update_content_metadata`` task. Additionally, housing
our own filtering logic will allow us to maintain and tweak/improve upon the functionality should we want additional
features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any concerns about local filtering logic in enterprise-catalog (apps.catalog.filters) diverging from how course-discovery does it? How do we keep two black boxes in sync? Do we even need to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that filters out the gate should match our discovery counter part, and that we would need rigorous tests to ensure our in house filters result in the same subsets of content. However from there, one of the benefits to this is that we get control of how the filters are administered and can change the behaviors to fit our needs and odd situations, no more need to go ask the phoenix team why there is one off odd behaviors, instead we can just adjust it ourselves ¯\_(ツ)_/¯

Comment on lines +72 to +75
Ideally this incremental process will allow us to provide a closer to real-time index using fewer resources. It will
also provide us with more flexibility about including non-course-discovery content in catalogs because we will
no-longer rely on a query to course-discovery's `search/all` endpoint and instead rely on the metadata records in the
catalog service, regardless of it's source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also have to do something when catalog queries are created or edited, so that the search index is updated to reflect any catalog <-> metadata relationships that are created/updated due to those queries being changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an excellent point- in a similar vein to us having to tie a content record updating to catalog query contentmetadata_set's being updated, we'd probably need to tie a query being updated to a process of calculating it's contentmetadata_set and then kick off incremental indexing processes for all effected content records.

Comment on lines +34 to +36
``update_content_metadata`` tasks and can eventually replace old infrastructure. The first method will be a bulk
job similar to the current ``update_content_metadata`` task to query external sources of content and update any records
should they mismatch using `apps.catalog.filters` to determine the query-content association sets. And second, an event
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bulk job like this, though, means that you're going to run your filter functions in proportion to (|# of queries| x |# of content metadata records|). Which means you're going to run it 10s of millions of times if we have thousands of queries and 10,000s of content.

Comment on lines +37 to +39
signal receiver which will process any individual content update events that are received. The intention is for the
majority of updates in the catalog service to happen at the moment they are updated in their external source and the
signal is fired, only to be cleaned up and verified by the bulk job later on should something go wrong.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea, but keep in mind, to keep content/catalog relationships up-to-date, an update of a single metadata records means we'll have to run the filter logic against every catalog query, because a change to content metadata could mean that a query should now include the content, or that the query should now not include the content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open-source-contribution PR author is not from Axim or 2U
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants