Dedup Backend Initial Implementation #2868

ikreymer · 2025-09-30T03:18:42Z

Fixes #2867

Set to feature branch as base.

The backend implementation involves:

A new CollIndex CRD type
Operator that manages the new CRD type, creating a new Redis instance when the index should exist
Operator starts the crawler in 'indexer' mode (will be available from Deduplication (Initial Support). browsertrix-crawler#884)
Collection has a new hasDedupIndex field
Workflows have a new 'dedupCollId' field for dedup while crawling. The dedupCollId must also be a collection that the crawl is auto-added to.
There is a new waiting state: waiting_for_dedup_index that is entered if a crawl is starting, but index is not yet ready.

For indexing, dependent on version of crawler (1.9.0 beta 0 or higher) that supports indexing mode.

Testing

This is ready for initial frontend work and testing:

the dedupCollId can be set on the workflow to enable dedup for future crawls.
the collection has a hasDedupIndex to indicate if an index is enabled for it.
kubectl get pods -n crawlers collindex should work

tw4l

A promising start!

I left comments/suggestions throughout where I noticed things. Should also add tests for setting dedupCollId for collection and crawlconfig add and update, I don't think it's quite right as-is (at least for collections).

backend/btrixcloud/colls.py

tw4l · 2025-09-30T16:26:52Z

backend/btrixcloud/colls.py

+        # accessing directly to handle both crawls and uploads
        crawl = await self.crawls.find_one({"_id": crawl_id})
-        crawl_coll_ids = crawl.get("collectionIds")
+        crawl_coll_ids = crawl.get("collectionIds") or []


Suggested change

crawl_coll_ids = crawl.get("collectionIds") or []

crawl_coll_ids = crawl.get("collectionIds", [])

Just a bit more idiomatic

tw4l · 2025-09-30T16:33:51Z

backend/btrixcloud/crawlconfigs.py

+        if update.dedupCollId:
+            if (
+                not update.autoAddCollections
+                or update.dedupCollId not in update.autoAddCollections
+            ):
+                raise HTTPException(
+                    status_code=400, detail="dedup_coll_id_not_in_autoadd"
+                )


This logic needs to account for the possibility that the collection referenced by dedupCollId may already be an auto-add collection in the original crawl config, prior to the update, which would also be fine so long as it's not being unset in the update. Can look to seed file validation logic just above for an example, and since it's a bit tricky to get right we should have good tests around dedupCollId (setting in initial collection add, updating when auto-add collection is already set, updating both together).

backend/btrixcloud/crawlmanager.py

backend/btrixcloud/k8sapi.py

backend/btrixcloud/models.py

tw4l · 2025-09-30T16:54:32Z

backend/btrixcloud/operator/collindexes.py

+        }
+
+    def get_import_ts(self, spec: CollIndexSpec, status: CollIndexStatus):
+        """returnt rue if a reimport is needed based on last import date"""


Suggested change

"""returnt rue if a reimport is needed based on last import date"""

"""return true if a reimport is needed based on last import date"""

tw4l · 2025-09-30T16:56:02Z

backend/btrixcloud/operator/collindexes.py

+
+    state: TYPE_INDEX_STATES = "initing"
+
+    lastCollUpdated: str = ""


Might suggest renaming to collLastUpdated, for a sec I thought we were checking other collections

tw4l · 2025-09-30T17:03:44Z

backend/btrixcloud/operator/crawls.py

+            return True
+
+        for index in data.related[COLLINDEX].values():
+            if index.get("status", {}).get("state") == "ready":


Is this the opposite of what we want? I think if the coll index is ready, we want to return False (i.e. don't need to wait, it's ready), right?

backend/btrixcloud/uploads.py

- add CollIndex crd - add new operator

update all models add dedupCollIndex on Crawl and CrawlConfig update crds

…up index to be ready before starting

…toAdd collections

…or index import with 'dedup_importer_channel'

Co-authored-by: Tessa Walsh <[email protected]>

- add enable_dedup_index() to enable it if it doesn't exist from crawl workflows - in collections, call create_coll_index() / delete_coll_index() when being explicitly add/removed

ikreymer requested a review from tw4l September 30, 2025 03:18

tw4l reviewed Sep 30, 2025

View reviewed changes

ikreymer added 15 commits September 30, 2025 22:15

operator crds:

312989d

- add CollIndex crd - add new operator

work

c961e1a

add crud for CollIndex object in collections

04ca4fe

add import job, minimally working

eff04ea

update all models add dedupCollIndex on Crawl and CrawlConfig update crds

add btrix-crds 0.2.0

077bfac

add dedupCollId to crawler, support running crawler with dedup!

ce8fba0

ensure collindex deleted on collection delete

ae36f93

add 'waiting_for_dedup_index' state to indicate crawl is awaiting ded…

bb1c5c0

…up index to be ready before starting

make storage and memory configureable: lower settings for tests

50aee14

configmap: add missing settings

7a80c8f

make dedupCollId independent, but require dedup coll to also be in au…

a2d9693

…toAdd collections

fix typo/formatting

5baf10d

index import channel: support setting custom crawler channel to use f…

e47ed52

…or index import with 'dedup_importer_channel'

configmap: fix quotes

eb777bc

fix autoadd uploads to collections

d8c3a05

ikreymer force-pushed the dedup-initial branch from fdbcaa1 to d8c3a05 Compare October 1, 2025 05:15

ikreymer and others added 5 commits September 30, 2025 22:20

Update backend/btrixcloud/crawlmanager.py

fc7a0f5

Co-authored-by: Tessa Walsh <[email protected]>

Apply suggestion from @tw4l

e24bf3f

Co-authored-by: Tessa Walsh <[email protected]>

Apply suggestion from @tw4l

98c49e3

Co-authored-by: Tessa Walsh <[email protected]>

refactor toggle_dedup_index():

ece2aa3

- add enable_dedup_index() to enable it if it doesn't exist from crawl workflows - in collections, call create_coll_index() / delete_coll_index() when being explicitly add/removed

chart: change index importer nodeAffinity to preferred not required

7251e57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Dedup Backend Initial Implementation #2868

Dedup Backend Initial Implementation #2868

Uh oh!

ikreymer commented Sep 30, 2025 •

edited

Loading

Uh oh!

tw4l left a comment

Uh oh!

Uh oh!

tw4l Sep 30, 2025

Uh oh!

tw4l Sep 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tw4l Sep 30, 2025

Uh oh!

tw4l Sep 30, 2025

Uh oh!

tw4l Sep 30, 2025

Uh oh!

Uh oh!

Uh oh!

	crawl_coll_ids = crawl.get("collectionIds") or []
	crawl_coll_ids = crawl.get("collectionIds", [])

	"""returnt rue if a reimport is needed based on last import date"""
	"""return true if a reimport is needed based on last import date"""


		state: TYPE_INDEX_STATES = "initing"

		lastCollUpdated: str = ""

Uh oh!

Dedup Backend Initial Implementation #2868

Are you sure you want to change the base?

Dedup Backend Initial Implementation #2868

Uh oh!

Conversation

ikreymer commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

tw4l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tw4l Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

tw4l Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tw4l Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

tw4l Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

tw4l Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ikreymer commented Sep 30, 2025 •

edited

Loading