Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Label Deletion Task #285

Open
nathanielrindlaub opened this issue Dec 10, 2024 · 5 comments
Open

Bulk Label Deletion Task #285

nathanielrindlaub opened this issue Dec 10, 2024 · 5 comments
Assignees

Comments

@nathanielrindlaub
Copy link
Member

Sub-task of #148.

Label deletion is particularly expensive because we need to read each affected Object and Image into memory and inspect the entire Objects' Labels array to determine how it needs to be updated once the Label we're trying to delete is removed. This makes it hard to optimize with a bulkWrite() operation, but maybe not impossible?

The first step would be to try to see what efficiency gains we can make by restructuring the deleteAnyLabels() and deleteAnyLabel() methods. The second is likely to move the operation to the task Lambda so that we aren't limited by the 30 second timeout.

Some open questions/to-dos:

  • Benchmark current label deletion times
  • What opportunities are there for the operations to be coerced into a single bulkWrite()? Would there be memory implications?
  • Could some clever schema modifications make this more efficient?

Note: in addition to deleteAnyLabels(), we have adeleteLabels() method that is only used when reverting labelsAdded from the frontend. This gets called when a user adds a bunch of Labels to a bunch of Objects using the multi-image selection menu, then uses ctrl-z to revert them. deleteLabels() is less of a worry right now because it uses bulkWrite(), so is probably pretty fast and can probably remain a synchronous operation, but I would like too know how many Images/Objects we can accommodate within the 30 second timeout is. Also, because DeleteLabelsInput is an array of objects that contain an imageId, objectId, and labelId, I imagine we also face the POST request payload-size bottleneck (the payload must be shorter than 262144 bytes). So it would be good to understand those limits and enforce them / handle them gracefully.

@lessej
Copy link
Collaborator

lessej commented Dec 15, 2024

@nathanielrindlaub This is turning out to be more complicated than I imagined -- I'm wondering, if an image has an object with multiple validated labels, and we remove the 'first validated,' should the image remain locked (because there's still a validated label) or should it be unlocked?

I think if it remains locked, this query is simplified. Assuming that's the case we could:

  1. Bulk $pull the labelid from image.objects.labels
  2. Remove any objects where image.objects.labels has length 0
  3. Set any objects where image.objects.labels doesn't contain a validated label to unlocked

@nathanielrindlaub
Copy link
Member Author

nathanielrindlaub commented Dec 16, 2024

I think my instinct is to unlock the object and not assume that just because another label is validated it should automatically become the new most-accurate source-of-truth. I also think it's probably good to provide some indication to users that something has changed on this image an may need to be re-reviewed, which unlocking the object would do. I could be swayed otherwise though.

I hadn't thought about the approach you're describing and I want to keep entertaining it, but I had been envisioning looking into iterating over an array of Image documents and building up a list of image-specific operations to then execute with a single bulkWrite() call (e.g., similarly to how we perform updateLabels()). I think there might be a path there?? But it would almost certainly involve some pretty complex/difficult to read MongoDB query language updates and may require holding a lot of data in memory.

@lessej
Copy link
Collaborator

lessej commented Dec 17, 2024

@nathanielrindlaub Gotcha. Thanks for explaining the use case for having that unlocked behavior.

I hadn't thought about the approach you're describing and I want to keep entertaining it, but I had been envisioning looking into iterating over an array of Image documents and building up a list of image-specific operations to then execute with a single bulkWrite()

Similarly, I didn't think of this 😄. I think we're both hitting at the same thing from different angles though (reduce the number of db calls by 'categorizing' operations in some way and then doing a bulk call).

Like you said, I was finding that the query gets complex really fast. I'm going to explore the approach you mentioned as I think that will be easier to read (from our perspective) and more performant than what we have now. My general approach will be:

  1. Pull list of images that have objects which have the label
  2. Split into images with objects that need to be removed, images with objects that need to be unlocked, images that can just have the label pulled
  3. Perform a bulk operation for each of these lists

@nathanielrindlaub
Copy link
Member Author

That sounds great @lessej. If you don't mind doing some bench-marking on the current execution times that would be good to record too.

@lessej
Copy link
Collaborator

lessej commented Dec 27, 2024

@nathanielrindlaub Happy holidays! I did a little bit of work on this feature. The unlocking operation is the one I think we're going to have most trouble with because we aren't able to interact with the 'images.objects' collection directly (we can only use it in the context of an image). I would like to pick your brain about filtering operations that are available in MongoDB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants