Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize to_history_dataset_association in create_datasets_from_library_folder #18970

Merged
merged 4 commits into from
Oct 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions lib/galaxy/model/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6048,7 +6048,9 @@ def __init__(
self.library_dataset = library_dataset
self.user = user

def to_history_dataset_association(self, target_history, parent_id=None, add_to_history=False, visible=None):
def to_history_dataset_association(
self, target_history, parent_id=None, add_to_history=False, visible=None, commit=True
):
sa_session = object_session(self)
hda = HistoryDatasetAssociation(
name=self.name,
Expand All @@ -6072,9 +6074,12 @@ def to_history_dataset_association(self, target_history, parent_id=None, add_to_
sa_session.add(hda)
hda.metadata = self.metadata
if add_to_history and target_history:
target_history.add_dataset(hda)
with transaction(sa_session):
sa_session.commit()
target_history.stage_addition(hda)
if commit:
target_history.add_pending_items()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
target_history.add_pending_items()
target_history.add_pending_items()
flush = False

you know your committing in there, so no need to do again on the next line.

Can you check where this method is used and see if not adding the dataset to a history and the flush=True ever make sense ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flush argument for the add_datasets function is always set to False in the current context, which means it won't trigger a commit. Right?
Also, I noticed that in other places, the datasets are committed after using this function, for example, in this part of the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, I keep forgetting that. It's fine then. The second part still applies, can you check in which cases we do in fact want to commit ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get what should look for exactly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing is different from committing: session.commit() will write to the database AND commit the current transaction, whereas session.flush() will write to the database transaction buffer without committing. What that means is that if we flush after adding or modifying a dataset, and on a subsequent commit there's any kind of db error that causes a rollback, whatever we flushed will be rolled back. If we commit, then the rollback on a subsequent commit won't affect the changes we have committed previously.

We have flush parameters in our code for historical reasons: pre-2.0, SQLAlchemy would commit on flush (with our autocommit=True setting), which is why the terms were used interchangeably. Now it's no longer the case; so my suggestion would be to use argument names consistent with what we do in the function (i.e., flush vs commit).

And with that said, I am not at all sure we should commit in that method :) The code @arash77 referenced replaced the pre-2.0 session.flush: we did that to ensure there was no behavioral change (a commit stayed a commit). Otherwise with hundreds of session.flush lines no longer committing, we'd never know what broke when moving to 2.0. HOWEVER, now that we have upgraded, we can take a hard look at all those commits and consider if we really need them, and, where possible, replace commits with flushes (unless we're optimizing like here, in which case we still can consider whether it's a commit that we need and not a flush).

Also, the code you referenced uses the transaction context manager to ensure that if there is no active transaction, we begin one before trying to commit. If there is a preceding session.add statement, there's guaranteed to be an active transaction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Just to clarify, for this case, should I change flush to commit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Just to clarify, for this case, should I change flush to commit?

Assuming you mean the name of the argument, if you intend to commit then yes, please. If you think flushing might be sufficient (so no session.commit(), then keep it as flush. In other words, as long as the argument is consistent with the action.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I intend to commit here.
Thank you, then I will just rename the argument.

if commit:
with transaction(sa_session):
sa_session.commit()
return hda

def copy(self, parent_id=None, target_folder=None, flush=True):
Expand Down
20 changes: 13 additions & 7 deletions lib/galaxy/webapps/galaxy/services/history_contents.py
Original file line number Diff line number Diff line change
Expand Up @@ -1172,22 +1172,28 @@ def traverse(folder):
rval.append(ld)
return rval

for ld in traverse(folder):
hda = ld.library_dataset_dataset_association.to_history_dataset_association(
history, add_to_history=True
hdas = [
ld.library_dataset_dataset_association.to_history_dataset_association(
history, add_to_history=True, commit=False
)
for ld in traverse(folder)
]
history.add_pending_items()

with transaction(trans.sa_session):
trans.sa_session.commit()

for hda in hdas:
hda_dict = self.hda_serializer.serialize_to_view(
hda, user=trans.user, trans=trans, encode_id=False, **serialization_params.model_dump()
)
rval.append(hda_dict)
return rval

else:
message = f"Invalid 'source' parameter in request: {source}"
raise exceptions.RequestParameterInvalidException(message)

with transaction(trans.sa_session):
trans.sa_session.commit()
return rval

def __create_dataset(
self,
trans,
Expand Down
Loading