Skip to content

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Sep 30, 2025

Fixes #2867

Set to feature branch as base.

The backend implementation involves:

  • A new CollIndex CRD type
  • Operator that manages the new CRD type, creating a new Redis instance when the index should exist
  • Operator starts the crawler in 'indexer' mode (will be available from Deduplication (Initial Support). browsertrix-crawler#884)
  • Collection has a new hasDedupIndex field
  • Workflows have a new 'dedupCollId' field for dedup while crawling. The dedupCollId must also be a collection that the crawl is auto-added to.
  • There is a new waiting state: waiting_for_dedup_index that is entered if a crawl is starting, but index is not yet ready.

For indexing, dependent on version of crawler (1.9.0 beta 0 or higher) that supports indexing mode.

Testing

This is ready for initial frontend work and testing:

  • the dedupCollId can be set on the workflow to enable dedup for future crawls.
  • the collection has a hasDedupIndex to indicate if an index is enabled for it.
  • kubectl get pods -n crawlers collindex should work

@ikreymer ikreymer requested a review from tw4l September 30, 2025 03:18
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A promising start!

I left comments/suggestions throughout where I noticed things. Should also add tests for setting dedupCollId for collection and crawlconfig add and update, I don't think it's quite right as-is (at least for collections).

# accessing directly to handle both crawls and uploads
crawl = await self.crawls.find_one({"_id": crawl_id})
crawl_coll_ids = crawl.get("collectionIds")
crawl_coll_ids = crawl.get("collectionIds") or []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
crawl_coll_ids = crawl.get("collectionIds") or []
crawl_coll_ids = crawl.get("collectionIds", [])

Just a bit more idiomatic

Comment on lines +595 to +637
if update.dedupCollId:
if (
not update.autoAddCollections
or update.dedupCollId not in update.autoAddCollections
):
raise HTTPException(
status_code=400, detail="dedup_coll_id_not_in_autoadd"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic needs to account for the possibility that the collection referenced by dedupCollId may already be an auto-add collection in the original crawl config, prior to the update, which would also be fine so long as it's not being unset in the update. Can look to seed file validation logic just above for an example, and since it's a bit tricky to get right we should have good tests around dedupCollId (setting in initial collection add, updating when auto-add collection is already set, updating both together).

}

def get_import_ts(self, spec: CollIndexSpec, status: CollIndexStatus):
"""returnt rue if a reimport is needed based on last import date"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""returnt rue if a reimport is needed based on last import date"""
"""return true if a reimport is needed based on last import date"""


state: TYPE_INDEX_STATES = "initing"

lastCollUpdated: str = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might suggest renaming to collLastUpdated, for a sec I thought we were checking other collections

return True

for index in data.related[COLLINDEX].values():
if index.get("status", {}).get("state") == "ready":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the opposite of what we want? I think if the coll index is ready, we want to return False (i.e. don't need to wait, it's ready), right?

ikreymer and others added 5 commits September 30, 2025 22:20
Co-authored-by: Tessa Walsh <[email protected]>
Co-authored-by: Tessa Walsh <[email protected]>
- add enable_dedup_index() to enable it if it doesn't exist from crawl workflows
- in collections, call create_coll_index() / delete_coll_index() when being explicitly add/removed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants