Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure that AutoGroups work correctly also when concurrent processes are executed #3650

Merged

Conversation

giovannipizzi
Copy link
Member

If many verdi run processes were run at the same time, there were cases in which the automatic group name generated by the AutoGroup was conflicting, resulting in IntegrityErrors from the database.
This issue fixes this problem by automatically generating a new name until this is unique.
Moreover, many bugs and issues found while addressing this issue are solved.

Fixes #997

@giovannipizzi
Copy link
Member Author

I would suggest not to squash the two commits - I am not 100% sure if the first, alone, would pass all tests, but the changes are splitted in two logical units.

If this is an issue, ok to squash, but then please keep both commit messages as they contain relevant information.

@ramirezfranciscof ramirezfranciscof self-assigned this Apr 1, 2020
In particular:

- deprecated use of 'group-name' in favour of 'group-label'
  both in the internal API of AutoGroups and in the CLI parameters
  of `verdi run`
- Improved the Autogroup implementation and API
- add various tests of the functionality of autogroups
- fixed the CLI parameters to include/exclude groups, that were badly
  broken (and untested). Now it accepts any entrypoint string (before
  it was accepting the first part of an node_type string). Also added
  click validation of the parameters.
- fixed the --group flag to activate/deactivate autogrouping (it was
  wrongly defined and thus always True)
- linter fixes
@giovannipizzi giovannipizzi force-pushed the fix_997_parallel_autogroups branch from 8bea787 to aeded55 Compare April 1, 2020 16:19
@giovannipizzi
Copy link
Member Author

I've rebased this, so it's ready to review (hoping the tests didn't break in the meantime...). @ramirezfranciscof do you want to go through this (together, or alone)?

Also, @sphuber, looking back I realised that I improved the performance of storing nodes.
After we merge, could you please re-run the test to see how many simulations you can submit at max rate, so we get a feeling if this change already gave an important performance boost? (it might!)

Also, let me reiterate the comment on not squashing, if possible.

@giovannipizzi giovannipizzi force-pushed the fix_997_parallel_autogroups branch from aeded55 to 825521f Compare April 2, 2020 07:53
@codecov
Copy link

codecov bot commented Apr 2, 2020

Codecov Report

Merging #3650 into develop will increase coverage by 0.82%.
The diff coverage is 84.69%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3650      +/-   ##
===========================================
+ Coverage    77.18%   78.00%   +0.82%     
===========================================
  Files          457      457              
  Lines        33795    33830      +35     
===========================================
+ Hits         26084    26390     +306     
+ Misses        7711     7440     -271     
Flag Coverage Δ
#django 70.05% <84.69%> (+0.83%) ⬆️
#sqlalchemy 70.87% <84.69%> (+0.82%) ⬆️
Impacted Files Coverage Δ
aiida/tools/ipython/ipython_magics.py 0.00% <0.00%> (ø)
aiida/cmdline/commands/cmd_plugin.py 25.00% <66.66%> (ø)
aiida/backends/testbase.py 93.39% <75.00%> (-1.51%) ⬇️
aiida/plugins/entry_point.py 88.18% <80.00%> (+0.10%) ⬆️
aiida/orm/autogroup.py 86.23% <85.00%> (+16.12%) ⬆️
aiida/cmdline/commands/cmd_run.py 89.55% <86.36%> (-3.44%) ⬇️
aiida/engine/processes/calcjobs/calcjob.py 82.26% <100.00%> (+3.39%) ⬆️
aiida/manage/caching.py 99.28% <100.00%> (ø)
aiida/orm/nodes/node.py 91.21% <100.00%> (+0.34%) ⬆️
aiida/orm/utils/node.py 92.07% <100.00%> (ø)
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 161eeae...9a68100. Read the comment docs.

@giovannipizzi giovannipizzi force-pushed the fix_997_parallel_autogroups branch 2 times, most recently from 0264ff7 to 5d9120a Compare April 2, 2020 09:38
@giovannipizzi
Copy link
Member Author

Sorry, it's taking a bit longer than expected as I found a new unexpected problem in the tests for backwards-migrations.
I have identified now the problem (a backward migration wasn't resetting correctly the migration after being run) and I have just force-pushed a commit that should fix this problem and actually also improve the code style of that specific class.
I have also fixed a few more prospector complaints along the way (for some reason unrelated to my changes, but it was complaining locally...).

@greschd
Copy link
Member

greschd commented Apr 2, 2020

Maybe related to the pre-commit and migration issues encountered in #3738?

@giovannipizzi
Copy link
Member Author

Maybe related to the pre-commit and migration issues encountered in #3738?

Most probably - let me fix it here (actually it's now fixed, I'm just fixing another hopefully final pylint complaint, that BTW I see was also addressed in #3738.

I've already force-pushed, if it works I'd suggest to merge this first.

BTW, I hate when you fix all problems and then prospector starts complaining of something new... ;-)

@giovannipizzi
Copy link
Member Author

Tests pass!
success

@sphuber @ramirezfranciscof this is ready for review, and @ltalirz this should also fix the migration test problem (or at least I fixed a problem in the same place you saw problems in #3738)

@sphuber
Copy link
Contributor

sphuber commented Apr 2, 2020

Before I start reviewing, @giovannipizzi I am afraid you may have made a mistake in your rebasing. You now have two commits, but both with the same commit message

@giovannipizzi
Copy link
Member Author

Before I start reviewing, @giovannipizzi I am afraid you may have made a mistake in your rebasing. You now have two commits, but both with the same commit message

Oh no... how was the reflog trick to get back old commit messages? Anybody knows?

@giovannipizzi
Copy link
Member Author

Never mind, I think I found it. I'll force-push with the correct message.

This also remove an overzelous isinstance check, and
moves additional checks in a cached function that is run only when
storing the very first node (that needs to be put in an autogroup),
making storing of nodes faster (even if times oscillates so it's hard
to estimate exactly by how much).

Also, added logic to allow for concurrent creation of multiple groups
(and test). This fixes aiidateam#997
@giovannipizzi giovannipizzi force-pushed the fix_997_parallel_autogroups branch from 972efcd to 43d852e Compare April 2, 2020 11:13
@giovannipizzi
Copy link
Member Author

Ok @sphuber I fixed the commit message.
I hope I didn't mix up the content of the two commits - in the worst case, ok to squash but keeping text from both commit messages.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @giovannipizzi . Have some comments, maybe easiest to discuss in person on skype/zoom?

aiida/cmdline/commands/cmd_run.py Outdated Show resolved Hide resolved
aiida/backends/testbase.py Show resolved Hide resolved
aiida/cmdline/commands/cmd_run.py Outdated Show resolved Hide resolved
aiida/cmdline/commands/cmd_run.py Show resolved Hide resolved
aiida/orm/autogroup.py Outdated Show resolved Hide resolved
aiida/orm/autogroup.py Show resolved Hide resolved
aiida/orm/autogroup.py Show resolved Hide resolved
aiida/orm/autogroup.py Show resolved Hide resolved
aiida/orm/autogroup.py Outdated Show resolved Hide resolved
aiida/orm/autogroup.py Show resolved Hide resolved
@giovannipizzi
Copy link
Member Author

Addressed all comments in a new commit. Let's see how many rounds of testing I'll need to do this time...

@sphuber
Copy link
Contributor

sphuber commented Apr 3, 2020

You should be able to run the docs and pre-commit also locally. For docs make -C docs html and for pre-commit pre-commit run --all. Then you won't have to push multiple times and depend on the CI

@giovannipizzi
Copy link
Member Author

Yes, thanks.
I know these two - for the docs, I was too optimistic and didn't think to have broken it.

For the pre-commit, the problem is that if I check only the problematic files, I get no error... (with pre-commit run --files tests/cmdline/commands/test_run.py).

I could run --all-files but it takes (on my PC) ~2 min spinning all CPUs at 100%.
Most problematic is the fact that if I run on all files I get errors that I don't see if I just check a single file, and even more weirdly I get errors in new parts of the code randomly popping out (e.g., I just run and I got this:

aiida/tools/ipython/ipython_magics.py
  Line: 37
    pylint: no-name-in-module / No name 'version_info' in module 'IPython'
  Line: 38
    pylint: import-error / Unable to import 'IPython.core'
    pylint: no-name-in-module / No name 'core' in module 'IPython'

I'll fix this (and I already fixed something similar earlier) but it's just very confusing.

BTW, a third test (sqlalchemy, py3.5) failed because it couldn't upload to codecov... Did anybody already see it? Mentioning @csadorf for info

In particular, the most important changes include:
- now the auto-group flag is called `--auto-group`, is
  a flag and is False by default
- only kept `--include` and `--exclude` options, checking
  typestrings and allowing to end with `%`. One benefit
  is that while reimplementing I replaced some `isinstance`
  with string comparisons, with potential further benefits.
- `--include` and `--exclude` are now mutually exclusive
- improved documentation of the main point of the issue
@giovannipizzi giovannipizzi force-pushed the fix_997_parallel_autogroups branch from 3ac609e to ba4b217 Compare April 3, 2020 12:55
@sphuber
Copy link
Contributor

sphuber commented Apr 3, 2020

BTW, a third test (sqlalchemy, py3.5) failed because it couldn't upload to codecov... Did anybody already see it? Mentioning @csadorf for info

This I have also already seen multiple times. For me was also py3.5 but probably just coincidence. Seems like a transient connection problem. Would be nice if this could be couched in a retry or if it could not fail the test. But then again, the coverage probably depends on it.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gio. Unfortunately I still have some questions. Happy to discuss in person

aiida/cmdline/commands/cmd_run.py Outdated Show resolved Hide resolved
aiida/orm/autogroup.py Outdated Show resolved Hide resolved
aiida/orm/autogroup.py Outdated Show resolved Hide resolved
aiida/orm/autogroup.py Outdated Show resolved Hide resolved
aiida/orm/autogroup.py Outdated Show resolved Hide resolved
aiida/orm/autogroup.py Show resolved Hide resolved
aiida/orm/autogroup.py Show resolved Hide resolved
Most notable: now the % sign can be anywhere in the string
of included or excluded classes, and not only at the end.
@giovannipizzi
Copy link
Member Author

All comments addressed and tests pass.

@giovannipizzi
Copy link
Member Author

This PR is cursed ;-) Fixed the problems after the merge, now ready to be merged!

@sphuber sphuber merged commit f40f0d2 into aiidateam:develop Apr 4, 2020
@sphuber
Copy link
Contributor

sphuber commented Apr 4, 2020

Let the curse be gone! Thanks @giovannipizzi !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running several verdi run processes independently crashes due to UniquenessError when creating the auto groups
4 participants