-
-
Notifications
You must be signed in to change notification settings - Fork 56
Dedup Backend Initial Implementation #2868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature-dedup
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A promising start!
I left comments/suggestions throughout where I noticed things. Should also add tests for setting dedupCollId
for collection and crawlconfig add and update, I don't think it's quite right as-is (at least for collections).
# accessing directly to handle both crawls and uploads | ||
crawl = await self.crawls.find_one({"_id": crawl_id}) | ||
crawl_coll_ids = crawl.get("collectionIds") | ||
crawl_coll_ids = crawl.get("collectionIds") or [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crawl_coll_ids = crawl.get("collectionIds") or [] | |
crawl_coll_ids = crawl.get("collectionIds", []) |
Just a bit more idiomatic
if update.dedupCollId: | ||
if ( | ||
not update.autoAddCollections | ||
or update.dedupCollId not in update.autoAddCollections | ||
): | ||
raise HTTPException( | ||
status_code=400, detail="dedup_coll_id_not_in_autoadd" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic needs to account for the possibility that the collection referenced by dedupCollId
may already be an auto-add collection in the original crawl config, prior to the update, which would also be fine so long as it's not being unset in the update. Can look to seed file validation logic just above for an example, and since it's a bit tricky to get right we should have good tests around dedupCollId
(setting in initial collection add, updating when auto-add collection is already set, updating both together).
} | ||
|
||
def get_import_ts(self, spec: CollIndexSpec, status: CollIndexStatus): | ||
"""returnt rue if a reimport is needed based on last import date""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""returnt rue if a reimport is needed based on last import date""" | |
"""return true if a reimport is needed based on last import date""" |
|
||
state: TYPE_INDEX_STATES = "initing" | ||
|
||
lastCollUpdated: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might suggest renaming to collLastUpdated
, for a sec I thought we were checking other collections
return True | ||
|
||
for index in data.related[COLLINDEX].values(): | ||
if index.get("status", {}).get("state") == "ready": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the opposite of what we want? I think if the coll index is ready, we want to return False
(i.e. don't need to wait, it's ready), right?
- add CollIndex crd - add new operator
update all models add dedupCollIndex on Crawl and CrawlConfig update crds
…up index to be ready before starting
…toAdd collections
…or index import with 'dedup_importer_channel'
fdbcaa1
to
d8c3a05
Compare
Co-authored-by: Tessa Walsh <[email protected]>
Co-authored-by: Tessa Walsh <[email protected]>
Co-authored-by: Tessa Walsh <[email protected]>
- add enable_dedup_index() to enable it if it doesn't exist from crawl workflows - in collections, call create_coll_index() / delete_coll_index() when being explicitly add/removed
Fixes #2867
Set to feature branch as base.
The backend implementation involves:
hasDedupIndex
fielddedupCollId
must also be a collection that the crawl is auto-added to.waiting_for_dedup_index
that is entered if a crawl is starting, but index is not yet ready.For indexing, dependent on version of crawler (1.9.0 beta 0 or higher) that supports indexing mode.
Testing
This is ready for initial frontend work and testing:
dedupCollId
can be set on the workflow to enable dedup for future crawls.hasDedupIndex
to indicate if an index is enabled for it.kubectl get pods -n crawlers collindex
should work