Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize to_history_dataset_association in create_datasets_from_library_folder #18970

Merged
merged 4 commits into from
Oct 11, 2024

Conversation

arash77
Copy link
Collaborator

@arash77 arash77 commented Oct 10, 2024

This pull request optimizes the to_history_dataset_association method in the __create_datasets_from_library_folder function. The method now includes an additional parameter flush which allows for more efficient handling of the transaction. This optimization improves the performance of creating datasets from a library folder.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@github-actions github-actions bot added the area/database Galaxy's database or data access layer label Oct 10, 2024
@github-actions github-actions bot added this to the 24.2 milestone Oct 10, 2024
@@ -6073,8 +6073,9 @@ def to_history_dataset_association(self, target_history, parent_id=None, add_to_
hda.metadata = self.metadata
if add_to_history and target_history:
target_history.add_dataset(hda)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_dataset also causes a flush, the modern way of handling this is history.stage_addition and then at the end history.add_pending_items

With optimizations like this it's usually a good idea to profile the changes so you can see a before/after, https://docs.galaxyproject.org/en/master/dev/finding_and_improving_slow_code.html has some pointers on that,

Copy link
Collaborator Author

@arash77 arash77 Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing a library folder with 26 datasets to the history
the results of profiling before optimization:

*** PROFILER RESULTS ***
_create (/home/arash/Galaxy/galaxy-arash/lib/galaxy/webapps/galaxy/api/history_contents.py:726)
function called 1 times

         633147 function calls (621986 primitive calls) in 1.851 seconds

   Ordered by: cumulative time, internal time, call count
   List reduced from 1643 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.859    1.859 history_contents.py:726(_create)
        1    0.000    0.000    1.859    1.859 history_contents.py:508(create)
        1    0.001    0.001    1.857    1.857 history_contents.py:1136(__create_datasets_from_library_folder)
       26    0.001    0.000    1.051    0.040 __init__.py:6051(to_history_dataset_association)
       26    0.000    0.000    0.753    0.029 base.py:760(serialize_to_view)
       26    0.000    0.000    0.753    0.029 hdas.py:637(serialize)
     5321    0.006    0.000    0.705    0.000 attributes.py:552(__get__)
2455/1851    0.004    0.000    0.701    0.000 attributes.py:1063(get)
  888/837    0.002    0.000    0.689    0.001 attributes.py:1108(_fire_loader_callables)
       26    0.000    0.000    0.604    0.023 datasets.py:777(serialize)
       26    0.004    0.000    0.601    0.023 base.py:686(serialize)
      733    0.006    0.000    0.600    0.001 strategies.py:867(_load_for_state)
 1731/631    0.006    0.000    0.561    0.001 state_changes.py:95(_go)
       27    0.000    0.000    0.550    0.020 session.py:1992(commit)
    53/27    0.000    0.000    0.550    0.020 <string>:1(commit)
    53/27    0.001    0.000    0.550    0.020 session.py:1306(commit)
      603    0.009    0.000    0.512    0.001 session.py:2134(_execute_internal)
      577    0.001    0.000    0.493    0.001 session.py:2301(execute)
      681    0.001    0.000    0.489    0.001 elements.py:506(_execute_on_connection)
      681    0.006    0.000    0.487    0.001 base.py:1590(_execute_clauseelement)
      655    0.001    0.000    0.474    0.001 base.py:1374(execute)
      577    0.002    0.000    0.455    0.001 context.py:295(orm_execute_statement)
       53    0.000    0.000    0.449    0.008 base.py:2614(commit)
       53    0.001    0.000    0.449    0.008 base.py:2731(_do_commit)
       53    0.000    0.000    0.448    0.008 base.py:2706(_connection_commit_impl)
       53    0.000    0.000    0.448    0.008 base.py:1131(_commit_impl)
       53    0.000    0.000    0.448    0.008 default.py:701(do_commit)
       53    0.447    0.008    0.447    0.008 {method 'commit' of 'sqlite3.Connection' objects}
      337    0.007    0.000    0.433    0.001 loading.py:526(load_on_pk_identity)
      394    0.011    0.000    0.404    0.001 strategies.py:994(_emit_lazyload)
       26    0.001    0.000    0.393    0.015 __init__.py:3329(add_dataset)
       26    0.002    0.000    0.342    0.013 __init__.py:3294(_next_hid)
      681    0.006    0.000    0.328    0.000 base.py:1791(_execute_context)
      681    0.007    0.000    0.294    0.000 base.py:1850(_exec_single_context)
     1573    0.005    0.000    0.281    0.000 {built-in method builtins.next}
    53/27    0.000    0.000    0.277    0.010 <string>:1(_prepare_impl)
    53/27    0.000    0.000    0.277    0.010 session.py:1271(_prepare_impl)
       26    0.001    0.000    0.276    0.011 session.py:4322(flush)
       26    0.001    0.000    0.275    0.011 session.py:4371(_flush)
      681    0.001    0.000    0.267    0.000 default.py:940(do_execute)

the results of profiling after optimization:

*** PROFILER RESULTS ***
_create (/home/arash/Galaxy/galaxy-arash/lib/galaxy/webapps/galaxy/api/history_contents.py:726)
function called 1 times

         485515 function calls (477225 primitive calls) in 0.814 seconds

   Ordered by: cumulative time, internal time, call count
   List reduced from 1734 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.819    0.819 history_contents.py:726(_create)
        1    0.000    0.000    0.819    0.819 history_contents.py:508(create)
        1    0.000    0.000    0.814    0.814 history_contents.py:1136(__create_datasets_from_library_folder)
       26    0.000    0.000    0.537    0.021 base.py:760(serialize_to_view)
       26    0.000    0.000    0.537    0.021 hdas.py:637(serialize)
       26    0.000    0.000    0.486    0.019 datasets.py:777(serialize)
     5139    0.005    0.000    0.484    0.000 attributes.py:552(__get__)
       26    0.004    0.000    0.483    0.019 base.py:686(serialize)
2003/1460    0.004    0.000    0.480    0.000 attributes.py:1063(get)
  777/751    0.002    0.000    0.470    0.001 attributes.py:1108(_fire_loader_callables)
      672    0.006    0.000    0.411    0.001 strategies.py:867(_load_for_state)
      337    0.010    0.000    0.348    0.001 strategies.py:994(_emit_lazyload)
      446    0.007    0.000    0.341    0.001 session.py:2134(_execute_internal)
      420    0.001    0.000    0.330    0.001 session.py:2301(execute)
      420    0.002    0.000    0.306    0.001 context.py:295(orm_execute_statement)
      210    0.005    0.000    0.259    0.001 loading.py:526(load_on_pk_identity)
      449    0.001    0.000    0.245    0.001 elements.py:506(_execute_on_connection)
      449    0.004    0.000    0.244    0.001 base.py:1590(_execute_clauseelement)
      423    0.001    0.000    0.236    0.001 base.py:1374(execute)
        1    0.000    0.000    0.171    0.171 history_contents.py:1175(<listcomp>)
      449    0.003    0.000    0.150    0.000 elements.py:668(_compile_w_cache)
       52    0.000    0.000    0.136    0.003 base.py:654(url_for)
       52    0.000    0.000    0.136    0.003 __init__.py:206(__call__)
       52    0.000    0.000    0.136    0.003 applications.py:106(url_path_for)
       26    0.001    0.000    0.136    0.005 __init__.py:6051(to_history_dataset_association)
       52    0.049    0.001    0.135    0.003 routing.py:656(url_path_for)
       26    0.000    0.000    0.115    0.004 datasets.py:731(serialize_creating_job)
      130    0.001    0.000    0.115    0.001 __init__.py:5120(creating_job)
       81    0.001    0.000    0.104    0.001 loading.py:1597(load_scalar_attributes)
       81    0.001    0.000    0.099    0.001 loading.py:487(load_on_ident)
      449    0.004    0.000    0.089    0.000 base.py:1791(_execute_context)
    29432    0.054    0.000    0.086    0.000 routing.py:268(url_path_for)
      420    0.004    0.000    0.081    0.000 context.py:565(orm_setup_cursor_result)
       26    0.000    0.000    0.079    0.003 hdas.py:609(<lambda>)
      461    0.002    0.000    0.078    0.000 langhelpers.py:1265(oneshot)
      449    0.001    0.000    0.076    0.000 cache_key.py:411(_generate_cache_key)
      449    0.003    0.000    0.075    0.000 cache_key.py:347(_generate_cache_key)
 2883/449    0.035    0.000    0.072    0.000 cache_key.py:221(_gen_cache_key)
      420    0.008    0.000    0.072    0.000 loading.py:78(instances)
       26    0.000    0.000    0.066    0.003 hdas.py:617(<lambda>)

sa_session.commit()
target_history.stage_addition(hda)
if flush:
target_history.add_pending_items()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
target_history.add_pending_items()
target_history.add_pending_items()
flush = False

you know your committing in there, so no need to do again on the next line.

Can you check where this method is used and see if not adding the dataset to a history and the flush=True ever make sense ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flush argument for the add_datasets function is always set to False in the current context, which means it won't trigger a commit. Right?
Also, I noticed that in other places, the datasets are committed after using this function, for example, in this part of the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, I keep forgetting that. It's fine then. The second part still applies, can you check in which cases we do in fact want to commit ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get what should look for exactly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing is different from committing: session.commit() will write to the database AND commit the current transaction, whereas session.flush() will write to the database transaction buffer without committing. What that means is that if we flush after adding or modifying a dataset, and on a subsequent commit there's any kind of db error that causes a rollback, whatever we flushed will be rolled back. If we commit, then the rollback on a subsequent commit won't affect the changes we have committed previously.

We have flush parameters in our code for historical reasons: pre-2.0, SQLAlchemy would commit on flush (with our autocommit=True setting), which is why the terms were used interchangeably. Now it's no longer the case; so my suggestion would be to use argument names consistent with what we do in the function (i.e., flush vs commit).

And with that said, I am not at all sure we should commit in that method :) The code @arash77 referenced replaced the pre-2.0 session.flush: we did that to ensure there was no behavioral change (a commit stayed a commit). Otherwise with hundreds of session.flush lines no longer committing, we'd never know what broke when moving to 2.0. HOWEVER, now that we have upgraded, we can take a hard look at all those commits and consider if we really need them, and, where possible, replace commits with flushes (unless we're optimizing like here, in which case we still can consider whether it's a commit that we need and not a flush).

Also, the code you referenced uses the transaction context manager to ensure that if there is no active transaction, we begin one before trying to commit. If there is a preceding session.add statement, there's guaranteed to be an active transaction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Just to clarify, for this case, should I change flush to commit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Just to clarify, for this case, should I change flush to commit?

Assuming you mean the name of the argument, if you intend to commit then yes, please. If you think flushing might be sufficient (so no session.commit(), then keep it as flush. In other words, as long as the argument is consistent with the action.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I intend to commit here.
Thank you, then I will just rename the argument.

@mvdbeek
Copy link
Member

mvdbeek commented Oct 11, 2024

Never seen that before ...
Screenshot 2024-10-11 at 12 41 23

@arash77
Copy link
Collaborator Author

arash77 commented Oct 11, 2024

I updated my branch yestarday but the PR wasn't updated!
image

@arash77 arash77 closed this Oct 11, 2024
@arash77 arash77 reopened this Oct 11, 2024
@arash77
Copy link
Collaborator Author

arash77 commented Oct 11, 2024

I think now it is fixed!

@jdavcs jdavcs merged commit 1eac67a into galaxyproject:dev Oct 11, 2024
52 of 85 checks passed
Copy link

This PR was merged without a "kind/" label, please correct.

@arash77 arash77 deleted the optimize_history_create branch October 11, 2024 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/database Galaxy's database or data access layer kind/enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants