Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor encore conformational_distance_matrix #1145

Merged
merged 15 commits into from
Jan 15, 2017

Conversation

kain88-de
Copy link
Member

@kain88-de kain88-de commented Dec 31, 2016

Fixes #1144, #1114

Changes made in this Pull Request:

  • refactor conformational_distance_matrix to use joblib for parallel processing
  • allows debugging the parallel called functions
  • rename ncores to n_jobs to have scikit-learn semantics
  • n_jobs=-1 uses all available cores
  • general clean ups

PR Checklist

  • Tests?
  • Docs?
    - [ ] CHANGELOG updated?
  • Issue raised/referenced?

@kain88-de kain88-de requested a review from mtiberti December 31, 2016 14:11
@kain88-de
Copy link
Member Author

@mtiberti and @wouterboomsma could you have a look over this when you have time.

@kain88-de kain88-de force-pushed the refactor-encore-confdist branch 6 times, most recently from e1d72b9 to 5dcc40a Compare January 5, 2017 08:25
@kain88-de
Copy link
Member Author

yes it finally builds. So this is ready for a review. (also removing all the code gives a big bump to coverage)

self.stdout.write(str(self))
self.stdout.flush()


def trm_indeces(a, b):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be indices not indeces, is it worth fixing the typo here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put that on my todo list

@kain88-de kain88-de force-pushed the refactor-encore-confdist branch from 400199d to 7076b4c Compare January 8, 2017 20:27
@kain88-de
Copy link
Member Author

@richardjgowers can you have a look why travis fails. My fix seems not to have worked.

@richardjgowers
Copy link
Member

Cool we're up to +0.4%

@kain88-de
Copy link
Member Author

Yeah removing code always has this nice effect.

@kain88-de kain88-de mentioned this pull request Jan 11, 2017
7 tasks
else:
a[0] = b[0]
a[1] = b[1] + 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was all this code unnecessary, or was it moved somewhere else? ( @mtiberti wrote this code, so I'm not entirely sure about this change )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joblib takes care now of generating good sized batches to work on. It does some automatic adjustment at the beginning. But we can also give it a rough estimate batchsize to use. This work stealling approach will use all available power until all computations are done. In the old code if a batch was done early the core would just idle until all batches were finished.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Makes sense.

def test_rmsd_matrix_with_superimposition(self):
@dec.skipif(module_not_found('sklearn'),
"Test skipped because sklearn is not available.")
def test_rmsd_matrix_with_superimposition(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is sklearn required here? In general, most of encore does not require sklearn. Only when selecting particular clustering or dimensionality reduction methods, sklearn is needed. But the default options for both use an a built-in method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it does now. I'm using joblib to do the parallel load balancing. This allows me to debug the changes in #1136 since it actually returns the exceptions that are thrown in the multiprocessing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add joblib as a dependency to the mdanalysis package. Then the sklearn guards can be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the connection between sklearn and joblib. Are you saying we could have added a guard on joblib here instead of sklearn? - and that the above works because joblib is a dependency of sklearn?

Copy link
Member Author

@kain88-de kain88-de Jan 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearn has developed the joblib library for easier parallel programming in python. They ship that library as an external dependency in sklearn.externals.joblib but the joblib library can also be used as a separate package. I currently just decided to use the version bundled with sklearn. I could go to use the standalone package. I hope that makes it clearer.

So everything right now that calculates a conformation distance matrix needs to use sklearn.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope. I might have to replace it with @dec.skipif(module_not_found('sklearn.externals.joblib') to detect only the joblib inside of sklearn. But joblib is a pure python package so installing it isn't any problem. I'll have some time on the weekend to do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I mean I'll use the joblib package. Then so sklearn won't need to be installed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks. I thought maybe sklearn would expose joblist as a global module when importing sklearn - although now that I think about it that would be a pretty ugly side effect.

I don't want to turn this into a huge deal. I can live with it either way - and you certainly have a better idea of what the overall strategy is for MDAnalysis. So, your call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also go with explicit joblib dependency. After all, we might use it elsewhere, too, without sklearn. More discussion in #1159.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... ah sorry, late to the party, you already did it #1145 (comment)

@wouterboomsma
Copy link
Contributor

@kain88-de Thanks for all the efforts in refactoring this code. I agree with most of what you did. However, I'm a bit puzzled about all the sklearn guards that have been added to test_encore. The test guards should only be necessary under very specific circumstances (when using non-standard clustering or dimensionality reduction methods.

@mtiberti
Copy link
Contributor

Hi everyone,
thanks a lot for this work and sorry to have kept you waiting - I recently got back from winter break and had a bit of backlog to clear. Happy to hear that joblib will improve the performance and usability of the code!

@kain88-de kain88-de force-pushed the refactor-encore-confdist branch 2 times, most recently from 41b1bd2 to a5a0548 Compare January 13, 2017 22:24
@kain88-de
Copy link
Member Author

I switched to use the joblib package now instead of the version shipped in scikit-learn. This follows our current unwritten rule that compilied packages in mda.analysis.* should only be an optional dependency for small parts of the code.

if 'encore' in mod:
sys.modules.pop(mod, None)

@block_import('sklearn')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused why this still passes, it should warn when joblib is blocked? I'll have to look into this before merging

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those tested packages still depend on sklarn. I shortened the list of tested imported modules. The joblib library is now a dependency and always installed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, thanks!

This is included in scikit-learn. It does work stealing job balancing for us
that helps to use the full processor power. The absolute biggest advantage of
this is though that I can now include print/exception errors for debugging in
the `conf_dist_function`.
now we only rely on numpy functions and joblib
This gives the initialization more freedom. We can have more types to choose
from and the metadata can be passed in as a dict and still be correctly handled.
@kain88-de kain88-de force-pushed the refactor-encore-confdist branch from a5a0548 to c069958 Compare January 14, 2017 14:30
@kain88-de
Copy link
Member Author

I removed the merge conflicts. Anything else that needs changing?

@wouterboomsma
Copy link
Contributor

@kain88-de Thanks. Looks great to me.

@kain88-de kain88-de merged commit 789a96c into MDAnalysis:develop Jan 15, 2017
@kain88-de kain88-de deleted the refactor-encore-confdist branch January 20, 2017 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants