Optimize to_history_dataset_association in create_datasets_from_library_folder #18970

arash77 · 2024-10-10T09:29:15Z

This pull request optimizes the to_history_dataset_association method in the __create_datasets_from_library_folder function. The method now includes an additional parameter flush which allows for more efficient handling of the transaction. This optimization improves the performance of creating datasets from a library folder.

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

…rary_folder

mvdbeek · 2024-10-10T09:44:34Z

lib/galaxy/model/__init__.py

@@ -6073,8 +6073,9 @@ def to_history_dataset_association(self, target_history, parent_id=None, add_to_
        hda.metadata = self.metadata
        if add_to_history and target_history:
            target_history.add_dataset(hda)


add_dataset also causes a flush, the modern way of handling this is history.stage_addition and then at the end history.add_pending_items

With optimizations like this it's usually a good idea to profile the changes so you can see a before/after, https://docs.galaxyproject.org/en/master/dev/finding_and_improving_slow_code.html has some pointers on that,

Importing a library folder with 26 datasets to the history
the results of profiling before optimization:

*** PROFILER RESULTS *** _create (/home/arash/Galaxy/galaxy-arash/lib/galaxy/webapps/galaxy/api/history_contents.py:726) function called 1 times 633147 function calls (621986 primitive calls) in 1.851 seconds Ordered by: cumulative time, internal time, call count List reduced from 1643 to 40 due to restriction <40> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 1.859 1.859 history_contents.py:726(_create) 1 0.000 0.000 1.859 1.859 history_contents.py:508(create) 1 0.001 0.001 1.857 1.857 history_contents.py:1136(__create_datasets_from_library_folder) 26 0.001 0.000 1.051 0.040 __init__.py:6051(to_history_dataset_association) 26 0.000 0.000 0.753 0.029 base.py:760(serialize_to_view) 26 0.000 0.000 0.753 0.029 hdas.py:637(serialize) 5321 0.006 0.000 0.705 0.000 attributes.py:552(__get__) 2455/1851 0.004 0.000 0.701 0.000 attributes.py:1063(get) 888/837 0.002 0.000 0.689 0.001 attributes.py:1108(_fire_loader_callables) 26 0.000 0.000 0.604 0.023 datasets.py:777(serialize) 26 0.004 0.000 0.601 0.023 base.py:686(serialize) 733 0.006 0.000 0.600 0.001 strategies.py:867(_load_for_state) 1731/631 0.006 0.000 0.561 0.001 state_changes.py:95(_go) 27 0.000 0.000 0.550 0.020 session.py:1992(commit) 53/27 0.000 0.000 0.550 0.020 <string>:1(commit) 53/27 0.001 0.000 0.550 0.020 session.py:1306(commit) 603 0.009 0.000 0.512 0.001 session.py:2134(_execute_internal) 577 0.001 0.000 0.493 0.001 session.py:2301(execute) 681 0.001 0.000 0.489 0.001 elements.py:506(_execute_on_connection) 681 0.006 0.000 0.487 0.001 base.py:1590(_execute_clauseelement) 655 0.001 0.000 0.474 0.001 base.py:1374(execute) 577 0.002 0.000 0.455 0.001 context.py:295(orm_execute_statement) 53 0.000 0.000 0.449 0.008 base.py:2614(commit) 53 0.001 0.000 0.449 0.008 base.py:2731(_do_commit) 53 0.000 0.000 0.448 0.008 base.py:2706(_connection_commit_impl) 53 0.000 0.000 0.448 0.008 base.py:1131(_commit_impl) 53 0.000 0.000 0.448 0.008 default.py:701(do_commit) 53 0.447 0.008 0.447 0.008 {method 'commit' of 'sqlite3.Connection' objects} 337 0.007 0.000 0.433 0.001 loading.py:526(load_on_pk_identity) 394 0.011 0.000 0.404 0.001 strategies.py:994(_emit_lazyload) 26 0.001 0.000 0.393 0.015 __init__.py:3329(add_dataset) 26 0.002 0.000 0.342 0.013 __init__.py:3294(_next_hid) 681 0.006 0.000 0.328 0.000 base.py:1791(_execute_context) 681 0.007 0.000 0.294 0.000 base.py:1850(_exec_single_context) 1573 0.005 0.000 0.281 0.000 {built-in method builtins.next} 53/27 0.000 0.000 0.277 0.010 <string>:1(_prepare_impl) 53/27 0.000 0.000 0.277 0.010 session.py:1271(_prepare_impl) 26 0.001 0.000 0.276 0.011 session.py:4322(flush) 26 0.001 0.000 0.275 0.011 session.py:4371(_flush) 681 0.001 0.000 0.267 0.000 default.py:940(do_execute)

the results of profiling after optimization:

*** PROFILER RESULTS *** _create (/home/arash/Galaxy/galaxy-arash/lib/galaxy/webapps/galaxy/api/history_contents.py:726) function called 1 times 485515 function calls (477225 primitive calls) in 0.814 seconds Ordered by: cumulative time, internal time, call count List reduced from 1734 to 40 due to restriction <40> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.819 0.819 history_contents.py:726(_create) 1 0.000 0.000 0.819 0.819 history_contents.py:508(create) 1 0.000 0.000 0.814 0.814 history_contents.py:1136(__create_datasets_from_library_folder) 26 0.000 0.000 0.537 0.021 base.py:760(serialize_to_view) 26 0.000 0.000 0.537 0.021 hdas.py:637(serialize) 26 0.000 0.000 0.486 0.019 datasets.py:777(serialize) 5139 0.005 0.000 0.484 0.000 attributes.py:552(__get__) 26 0.004 0.000 0.483 0.019 base.py:686(serialize) 2003/1460 0.004 0.000 0.480 0.000 attributes.py:1063(get) 777/751 0.002 0.000 0.470 0.001 attributes.py:1108(_fire_loader_callables) 672 0.006 0.000 0.411 0.001 strategies.py:867(_load_for_state) 337 0.010 0.000 0.348 0.001 strategies.py:994(_emit_lazyload) 446 0.007 0.000 0.341 0.001 session.py:2134(_execute_internal) 420 0.001 0.000 0.330 0.001 session.py:2301(execute) 420 0.002 0.000 0.306 0.001 context.py:295(orm_execute_statement) 210 0.005 0.000 0.259 0.001 loading.py:526(load_on_pk_identity) 449 0.001 0.000 0.245 0.001 elements.py:506(_execute_on_connection) 449 0.004 0.000 0.244 0.001 base.py:1590(_execute_clauseelement) 423 0.001 0.000 0.236 0.001 base.py:1374(execute) 1 0.000 0.000 0.171 0.171 history_contents.py:1175(<listcomp>) 449 0.003 0.000 0.150 0.000 elements.py:668(_compile_w_cache) 52 0.000 0.000 0.136 0.003 base.py:654(url_for) 52 0.000 0.000 0.136 0.003 __init__.py:206(__call__) 52 0.000 0.000 0.136 0.003 applications.py:106(url_path_for) 26 0.001 0.000 0.136 0.005 __init__.py:6051(to_history_dataset_association) 52 0.049 0.001 0.135 0.003 routing.py:656(url_path_for) 26 0.000 0.000 0.115 0.004 datasets.py:731(serialize_creating_job) 130 0.001 0.000 0.115 0.001 __init__.py:5120(creating_job) 81 0.001 0.000 0.104 0.001 loading.py:1597(load_scalar_attributes) 81 0.001 0.000 0.099 0.001 loading.py:487(load_on_ident) 449 0.004 0.000 0.089 0.000 base.py:1791(_execute_context) 29432 0.054 0.000 0.086 0.000 routing.py:268(url_path_for) 420 0.004 0.000 0.081 0.000 context.py:565(orm_setup_cursor_result) 26 0.000 0.000 0.079 0.003 hdas.py:609(<lambda>) 461 0.002 0.000 0.078 0.000 langhelpers.py:1265(oneshot) 449 0.001 0.000 0.076 0.000 cache_key.py:411(_generate_cache_key) 449 0.003 0.000 0.075 0.000 cache_key.py:347(_generate_cache_key) 2883/449 0.035 0.000 0.072 0.000 cache_key.py:221(_gen_cache_key) 420 0.008 0.000 0.072 0.000 loading.py:78(instances) 26 0.000 0.000 0.066 0.003 hdas.py:617(<lambda>)

…add_dataset`

mvdbeek · 2024-10-10T10:47:40Z

lib/galaxy/model/__init__.py

-            sa_session.commit()
+            target_history.stage_addition(hda)
+            if flush:
+                target_history.add_pending_items()


Suggested change

target_history.add_pending_items()

target_history.add_pending_items()

flush = False

you know your committing in there, so no need to do again on the next line.

Can you check where this method is used and see if not adding the dataset to a history and the flush=True ever make sense ?

The flush argument for the add_datasets function is always set to False in the current context, which means it won't trigger a commit. Right?
Also, I noticed that in other places, the datasets are committed after using this function, for example, in this part of the code.

Ah, right, I keep forgetting that. It's fine then. The second part still applies, can you check in which cases we do in fact want to commit ?

I didn't get what should look for exactly.

Flushing is different from committing: session.commit() will write to the database AND commit the current transaction, whereas session.flush() will write to the database transaction buffer without committing. What that means is that if we flush after adding or modifying a dataset, and on a subsequent commit there's any kind of db error that causes a rollback, whatever we flushed will be rolled back. If we commit, then the rollback on a subsequent commit won't affect the changes we have committed previously.

We have flush parameters in our code for historical reasons: pre-2.0, SQLAlchemy would commit on flush (with our autocommit=True setting), which is why the terms were used interchangeably. Now it's no longer the case; so my suggestion would be to use argument names consistent with what we do in the function (i.e., flush vs commit).

And with that said, I am not at all sure we should commit in that method :) The code @arash77 referenced replaced the pre-2.0 session.flush: we did that to ensure there was no behavioral change (a commit stayed a commit). Otherwise with hundreds of session.flush lines no longer committing, we'd never know what broke when moving to 2.0. HOWEVER, now that we have upgraded, we can take a hard look at all those commits and consider if we really need them, and, where possible, replace commits with flushes (unless we're optimizing like here, in which case we still can consider whether it's a commit that we need and not a flush).

Also, the code you referenced uses the transaction context manager to ensure that if there is no active transaction, we begin one before trying to commit. If there is a preceding session.add statement, there's guaranteed to be an active transaction.

Thanks for the explanation! Just to clarify, for this case, should I change flush to commit?

Thanks for the explanation! Just to clarify, for this case, should I change flush to commit?

Assuming you mean the name of the argument, if you intend to commit then yes, please. If you think flushing might be sufficient (so no session.commit(), then keep it as flush. In other words, as long as the argument is consistent with the action.

Yes, I intend to commit here.
Thank you, then I will just rename the argument.

mvdbeek · 2024-10-11T10:41:38Z

Never seen that before ...

arash77 · 2024-10-11T10:42:44Z

I updated my branch yestarday but the PR wasn't updated!

arash77 · 2024-10-11T10:45:39Z

I think now it is fixed!

github-actions · 2024-10-11T13:25:14Z

This PR was merged without a "kind/" label, please correct.

Optimize to_history_dataset_association in __create_datasets_from_lib…

ebc5dc9

…rary_folder

github-actions bot added the area/database Galaxy's database or data access layer label Oct 10, 2024

github-actions bot added this to the 24.2 milestone Oct 10, 2024

arash77 mentioned this pull request Oct 10, 2024

Add import endpoint for history contents from library #18960

Closed

4 tasks

mvdbeek reviewed Oct 10, 2024

View reviewed changes

Using stage_addition and add_pending_items functions instead of `…

46cbd11

…add_dataset`

mvdbeek reviewed Oct 10, 2024

View reviewed changes

arash77 added 2 commits October 10, 2024 14:59

serialize after commit in create datasets from library folder

5e924cc

Rename flush in to_history_dataset_association to commit

4bf029b

mvdbeek approved these changes Oct 11, 2024

View reviewed changes

arash77 closed this Oct 11, 2024

arash77 reopened this Oct 11, 2024

jdavcs approved these changes Oct 11, 2024

View reviewed changes

jdavcs merged commit 1eac67a into galaxyproject:dev Oct 11, 2024
52 of 85 checks passed

arash77 deleted the optimize_history_create branch October 11, 2024 13:26

arash77 added the kind/enhancement label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize to_history_dataset_association in create_datasets_from_library_folder #18970

Optimize to_history_dataset_association in create_datasets_from_library_folder #18970

arash77 commented Oct 10, 2024

mvdbeek Oct 10, 2024

arash77 Oct 10, 2024 •

edited

Loading

mvdbeek Oct 10, 2024

arash77 Oct 10, 2024

mvdbeek Oct 10, 2024

arash77 Oct 10, 2024

jdavcs Oct 10, 2024

arash77 Oct 10, 2024

jdavcs Oct 10, 2024

arash77 Oct 10, 2024

mvdbeek commented Oct 11, 2024

arash77 commented Oct 11, 2024

arash77 commented Oct 11, 2024

github-actions bot commented Oct 11, 2024

	target_history.add_pending_items()
	target_history.add_pending_items()
	flush = False

Optimize to_history_dataset_association in create_datasets_from_library_folder #18970

Optimize to_history_dataset_association in create_datasets_from_library_folder #18970

Conversation

arash77 commented Oct 10, 2024

How to test the changes?

License

Choose a reason for hiding this comment

arash77 Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvdbeek commented Oct 11, 2024

arash77 commented Oct 11, 2024

arash77 commented Oct 11, 2024

github-actions bot commented Oct 11, 2024

arash77 Oct 10, 2024 •

edited

Loading