Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up transaction management for file_complete handler #930

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

BenGalewsky
Copy link
Contributor

@BenGalewsky BenGalewsky commented Nov 25, 2024

Problem

The TransformerFileComplete resource handler is the most critical code in the entire stack. It responds to each file in the dataset being transformed, and is responsible for updating the total number of files processed (either successfully, or failure) - these two counters are how we determine if the transform is complete. The endpoint will be hit repeatedly by all of the running transformers. Consequently, database transaction handing is very important to avoid missing files.

The current implementation uses implicit transactions and doesn't manage locks and flushing to the DB. It's possible that this allows for files to be lost during big transform requests.

Approach

  1. Us the DB session to explicitly manage transactions
  2. Update the record_file_complete to read the request with with_for_update flag set which will lock the record in the db
  3. The increments to files are handled in the same transaction
  4. To make the file more readable, the retry call arguments are captured in a single, new decorator.file_complete_ops_retry -

With this decorator, I ran into a problem with unit tests. Importing the module caused the current_app.logger expression to be evaluated. This would throw RuntimeError: Working outside of application context. in the unit tests. Worked around this in the decorator to only access that logger if we are inside the flask app

@BenGalewsky BenGalewsky marked this pull request as draft November 25, 2024 21:22
Base automatically changed from delete_fixes to develop November 26, 2024 04:33
@BenGalewsky BenGalewsky force-pushed the file_complete_transaction branch from 70a6702 to c2fb610 Compare December 4, 2024 13:50
@BenGalewsky BenGalewsky requested a review from ponyisi December 4, 2024 19:09
@BenGalewsky BenGalewsky marked this pull request as ready for review December 4, 2024 19:09
else:
transform_req.files_failed += 1

session.flush() # Flush the changes to the database
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the state of the DB if the transaction attempt fails three times? Do we again go out of sync?

What isolation level are we working at? (Are we certainly taking a row lock, or could the transaction fail because it's been updated somewhere else?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three operations. One of them is on table of transform results. Each file complete has it's own row, so there is no contention there.

The part that causes us the most concern is the number of files processed column in the TransformTable. That's the one I put the heaviest level of locking on.

The other interesting one is declaring that the last file has been processed and so we can call the transform complete and shut down the transformers. I would argue that's not catastrophic if it happens twice (It is a big deal when it never happens!)

My biggest concern is deadlock, which is why I make the update of the TransformResult table a different transaction from the update of files remaining. So I think there is no chance for deadlock.

So, but good point about how to handle failures of each transaction/function call in this resource. Problem is, I don't think we have any good rollback options.

  1. The update to the files_complete fails: We are in big trouble. There is no way to recover this and the transform will never quit.
  2. The update to the transform_result fails... I don't think this will have much impact. As Illija points out I'm not sure who uses this table since we just scan the bucket for files to return in the client.
  3. The status update fails. This is also a big deal, but not sure what we could do about it.

I suppose I should put some try/except blocks in this code to manage this, but as I say I don't know what to do if we catch an exception (and a function fails after three tries)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'd say we should try to have an explicit logging of a situation like this. (Some day it might be nice to make an alert on a dashboard or an email?) Is it also possible to mark the transformation failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done here - it's just an Error log with a distinctive message.

@BenGalewsky BenGalewsky force-pushed the file_complete_transaction branch from c2fb610 to 4a4ca9a Compare December 10, 2024 22:18
@BenGalewsky BenGalewsky requested a review from ponyisi December 16, 2024 16:38
The celery transaction model settings to guarantee we never lose a file
could result in duplicate file reports. Handle this by adding a
unique key to the transformation_result table. Attempt to insert the
record and just report a warning if that fails, but don't increment the
file counter.
@BenGalewsky BenGalewsky force-pushed the file_complete_transaction branch from 8ca7037 to 9ad4732 Compare January 15, 2025 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants