-
-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up transaction management for file_complete handler #930
base: develop
Are you sure you want to change the base?
Conversation
70a6702
to
c2fb610
Compare
servicex_app/servicex_app/resources/internal/transformer_file_complete.py
Show resolved
Hide resolved
else: | ||
transform_req.files_failed += 1 | ||
|
||
session.flush() # Flush the changes to the database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the state of the DB if the transaction attempt fails three times? Do we again go out of sync?
What isolation level are we working at? (Are we certainly taking a row lock, or could the transaction fail because it's been updated somewhere else?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three operations. One of them is on table of transform results. Each file complete has it's own row, so there is no contention there.
The part that causes us the most concern is the number of files processed column in the TransformTable. That's the one I put the heaviest level of locking on.
The other interesting one is declaring that the last file has been processed and so we can call the transform complete and shut down the transformers. I would argue that's not catastrophic if it happens twice (It is a big deal when it never happens!)
My biggest concern is deadlock, which is why I make the update of the TransformResult table a different transaction from the update of files remaining. So I think there is no chance for deadlock.
So, but good point about how to handle failures of each transaction/function call in this resource. Problem is, I don't think we have any good rollback options.
- The update to the files_complete fails: We are in big trouble. There is no way to recover this and the transform will never quit.
- The update to the transform_result fails... I don't think this will have much impact. As Illija points out I'm not sure who uses this table since we just scan the bucket for files to return in the client.
- The status update fails. This is also a big deal, but not sure what we could do about it.
I suppose I should put some try
/except
blocks in this code to manage this, but as I say I don't know what to do if we catch an exception (and a function fails after three tries)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'd say we should try to have an explicit logging of a situation like this. (Some day it might be nice to make an alert on a dashboard or an email?) Is it also possible to mark the transformation failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done here - it's just an Error log with a distinctive message.
c2fb610
to
4a4ca9a
Compare
The celery transaction model settings to guarantee we never lose a file could result in duplicate file reports. Handle this by adding a unique key to the transformation_result table. Attempt to insert the record and just report a warning if that fails, but don't increment the file counter.
8ca7037
to
9ad4732
Compare
Problem
The TransformerFileComplete resource handler is the most critical code in the entire stack. It responds to each file in the dataset being transformed, and is responsible for updating the total number of files processed (either successfully, or failure) - these two counters are how we determine if the transform is complete. The endpoint will be hit repeatedly by all of the running transformers. Consequently, database transaction handing is very important to avoid missing files.
The current implementation uses implicit transactions and doesn't manage locks and flushing to the DB. It's possible that this allows for files to be lost during big transform requests.
Approach
record_file_complete
to read the request withwith_for_update
flag set which will lock the record in the dbretry
call arguments are captured in a single, new decorator.file_complete_ops_retry
-With this decorator, I ran into a problem with unit tests. Importing the module caused the
current_app.logger
expression to be evaluated. This would throwRuntimeError: Working outside of application context.
in the unit tests. Worked around this in the decorator to only access that logger if we are inside the flask app