Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open PR for Streaming Partitioning #1502

Open
wants to merge 519 commits into
base: feature/assembly/junction_count
Choose a base branch
from

Conversation

camillescott
Copy link
Member

@camillescott camillescott commented Nov 3, 2016

Open PR for working on streaming partitioning. Note that this PR contains the JunctionCountAssembler changes. Some tentative plans...

  • Implement streaming component tracking. This reimplements the functionality of SubsetPartition, though in parallel; it doesn't use the existing tagging system, and doesn't attempt to detect lumps. Mostly sketched out now, needs testing and stuff.
  • Component statistics. Tracking coverage information on a per-component basis. A couple ways to do this / some considerations:
    • Track component coverage exactly from all k-mers. Probably slow, and unnecessary.
    • Track via tags. Easy to do with existing tag impl.
    • Associate minhash with component, track coverage via minhash k-mers. Cool, and might result in smaller workload. Other uses for component minhashes.
    • Efficiently encoding coverage? Store params of Poission dist (or neg binomial)?
  • Use coverage stats to break up components. Either:
    • Track stream of coverages, detect significant change in coverage dist when merging components (suggests we might have connected two erroneously).
    • Implement alg: least number of samples (k-mers) to remove where dist is better fit by mixture of two dists than one.
    • Or, maybe this is a bad idea. Experimentation will illuminate...

  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
    http://en.wikipedia.org/wiki/Changelog#Format
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)
  • Is the Copyright year up to date?

@betatim
Copy link
Member

betatim commented Nov 23, 2016

Digging into this to compare to #1538's benchmarks. Mainly interested in the cython stuff.

Compiling dc0d0a6 fails with:

lib/khmer.hh:43:11: fatal error: 'cstdint' file not found
#       include <cstdint>

on OSX 10.11.6 with:

$ gcc --version                                                                                                  (khmer-py35)
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.38)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

<cstdint> isn't part of the stdlib, for some reason it is still in tr1/. After modifying the #ifdef to use cstdint.h for the moment I get failures about std::function not being defined. Conclusion: for some reason the compiler isn't a c++11 compiler?? my guess is that some flag somewhere got mangled/dropped :-/

setup.py Outdated
@@ -144,7 +166,7 @@ def check_for_openmp():
EXTRA_COMPILE_ARGS = ['-O3', '-std=c++11', '-pedantic']
EXTRA_LINK_ARGS = []

if sys.platform == 'darwin':
if sys.platform == 'darwin' and 'clang' in os.getenv('CC', 'cc'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this new check fixes the compilation problems. Not sure what a better way of detecting that we are using clang vs gcc is though as on OSX the default compiler is clang but it also responds to call to gcc etc. Only thing I can think of is looking at the output of gcc --version. Are there any people who use actual gcc on OSX?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes... I was trying to get gcc compilation working on OSX, because the gcc compiled version is much faster than the clang one, for reasons that aren't clear. This is probably what's borking it on @ctb 's machine as well.

betatim and others added 24 commits January 9, 2017 16:43
enable static building using pkg-config
Resolving the pytest-runner dependency sometimes fails
when there is no network but not always. This fixes the symptom
but probably isn't the correct long term fix.
A halfway house between storing counts in a full byte and only
tracking presence/absence. Uses four bits to track counts for
each kmer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants