Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CountMatrix bugfix and new tests #43

Merged
merged 4 commits into from
Jul 25, 2018
Merged

Conversation

mbabadi
Copy link
Contributor

@mbabadi mbabadi commented Jun 27, 2018

This PR fixes some algorithmic issues in the CountMatrix class. It also improves related unit tests. The current counting algorithm implemented in CountMatrix.from_sorted_tagged_bam is expected to be functionally similar to CellRanger methods.

Closes #42.

@mbabadi mbabadi requested a review from ambrosejcarr June 27, 2018 21:01
new tests for count matrix computation from BAM
updated count matrix test documentation
updated count matrix class documentation
basic refactoring and code cleanup
@mbabadi mbabadi force-pushed the mb-cell-ranger-func-equivalence branch from 47eb3c3 to e736196 Compare June 27, 2018 21:03
@codecov-io
Copy link

codecov-io commented Jun 27, 2018

Codecov Report

Merging #43 into master will increase coverage by 0.26%.
The diff coverage is 94.84%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #43      +/-   ##
==========================================
+ Coverage   94.67%   94.94%   +0.26%     
==========================================
  Files          24       25       +1     
  Lines        1727     1997     +270     
==========================================
+ Hits         1635     1896     +261     
- Misses         92      101       +9
Impacted Files Coverage Δ
src/sctools/test/test_entrypoints.py 100% <100%> (ø) ⬆️
src/sctools/consts.py 100% <100%> (ø)
src/sctools/test/test_bam.py 98.68% <100%> (ø) ⬆️
src/sctools/test/test_gtf.py 100% <100%> (ø) ⬆️
src/sctools/test/test_fastq.py 100% <100%> (ø) ⬆️
src/sctools/encodings.py 96.19% <100%> (ø) ⬆️
src/sctools/__init__.py 100% <100%> (ø) ⬆️
src/sctools/test/test_metrics.py 100% <100%> (ø) ⬆️
src/sctools/test/test_barcode.py 97.64% <100%> (ø) ⬆️
src/sctools/metrics/aggregator.py 91.27% <100%> (+0.05%) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24ed25f...57b9369. Read the comment docs.

Copy link
Member

@ambrosejcarr ambrosejcarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome. It was really fun to read -- you've obviously done a ton of work with bam files and have added some really cool tricks here.

I left you a bunch of questions and requests which are hopefully not too onerous. :)

In case I missed something, I also reviewed the notebook here: HumanCellAtlas/skylab#118

return self._col_index

@staticmethod
def _get_alignments_grouped_by_query_name_generator(bam_file: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like arguments on a newline to reduce leading whitespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""
with pysam.AlignmentFile(bam_file, mode=open_mode) as bam_records:
for alignment in itertools.groupby(bam_records, key=lambda record: record.query_name):
query_name: str = alignment[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this section could be just:

for (query_name, grouper) in itertools.groupby(bam_records, key=lambda record: record.query_name):
    alignments: List[pysam.AlignedSegment] = [a for a in grouper]  # or list(grouper)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

gene_name = record.get_attribute('gene_name')
if gene_name is None:
raise ValueError(
'Malformed GTF file detected. Record is of type gene but does not have a '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a fan of f-style strings everywhere (although the repository has not yet been converted).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While at it, I converted all exception messages to f-stye in the repo.

sp.save_npz(prefix + '.npz', self._matrix, compressed=True)
np.save(prefix + '_row_index.npy', self._row_index)
np.save(prefix + '_col_index.npy', self._col_index)

@classmethod
def load(cls, prefix: str):
def load(cls, prefix: str) -> 'CountMatrix':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍



@pytest.mark.parametrize('query_sort_order', ['CB_UB_GE_n', 'n'])
def test_count_matrix_from_bam(query_sort_order):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the tests, it looks like the same data is used for each test, but then sorted differently. Could you factor out the synthetic data generation into a pytest.fixture(scope='module') such that the data generation is only done once? This could speed up testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion, but let me pass on this one -- it's too much work for little gain. The main speed bottleneck in this test, by and large, is loading chr1 annotations. By the way, it is not uncommon for test suites to take hours to complete! (GATK takes about 1 hour).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, had we used a truncated (or synthetic) set of annotations, we would not have discovered the indexing bug caused by multiple entries per gene!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe pretty strongly in responsive unit testing -- it aids in a fast development cycle. Mocks and fixtures can help with this, and they are not difficult to work with once you get a hang of it. I would bet that @rexwangcc could definitely help with this.

Given what you said about loading the annotations, I'd suggest cutting down on the chr1 annotation sizes, and loading the annotations as a fixture so they only get loaded once for further tests.

Could you use a smaller subset of the annotations but still maintain the same testing quality? This repo is not a great example, as we obviously have testing holes, but the existing tests complete in < 5s, so I think there's a burden to prove that the tests you're adding need to be so much long er.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, truncated chr1 annotations, changed some signatures, added gene names as a fixture. Test is fast now.

incomplete_alignment_tags_list,
multiple_alignment_tags_list)

def generate_synthetic_bam_and_counts_matrix(self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: for python code, I like to read top to bottom. This function calls a bunch of the private methods below. My tendency would be to move this below those. Do you have any strong opinions on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree -- I moved things around for better code readability.

def _generate_incomplete_alignment_tags(self, num_missing_some_tags: int) -> List[AlignmentRecordTags]:
incomplete_alignment_tags_list: List[AlignmentRecordTags] = list()
for _ in range(num_missing_some_tags):
tag_mask = self.rng.randint(low=0, high=7)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like this way of masking. Very efficient!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

_test_annotation_file,
_test_gene_expression_rate)

with tempfile.TemporaryDirectory() as _test_temp_dir:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a finally clause to explicitly shutil.rmtree the temp dir -- on some operating systems, these files get left around until restart, so it's nice to clean up after ourselves. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it just to be safe, but I have a feeling that this is redundant. There are several instances of improper usage of tempfile.TemporaryDirectory() in sctools and that is likely to be the cause of improper cleanup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, made #45.

@@ -1,13 +1,12 @@
import fileinput
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sorting 👍

assert np.allclose(csr1.indptr, csr2.indptr)
assert np.allclose(csr1.data, csr2.data)
# assert equality
assert np.allclose(sorted_count_matrix_data_from_bam, sorted_count_matrix_data_expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a docstring to this test that's super clear on what we're testing here? My initial concern was that we're generating synthetic bams and count matrices, and using similar code to transition between them, instead of using some kind of external gold standard that we're aiming to hit.

Done well, these tests are probably way better than some small matrix or bam used to smoke test the functionality, but there's also an opportunity to code 2 + 2 = 5 somewhere into both the count matrix generation and the bam_to_count function.

[moonshot] Longer term, when we've figured out all the discrepancies, what do you think about trying to run the cell ranger code in our tests to verify we're equivalent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an extensive preamble documentation block.

gene_id_tag: str='GE',
open_mode: str='rb',
):
gene_name_tag: str='GE',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I forgot to ask: are these HGNC gene names that are going in the tags, and not the numerical IDs? If so then this is a good change.

If they are IDs, I'd prefer to keep the name as ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, they actually gene names and not IDs (we reply on dropseq-tools for tagging genes; it uses gene names by default, there is no way to force it to use gene IDs).

@mbabadi
Copy link
Contributor Author

mbabadi commented Jul 6, 2018

@ambrosejcarr I'm done -- thanks for the thorough review! let me rebase and regroups the commits and it will be good to go for your next review.

f-style exceptions
fastq to FASTQ in fast.py docblocks
some code cleanup
@mbabadi mbabadi force-pushed the mb-cell-ranger-func-equivalence branch from 90ae59b to 394fb21 Compare July 9, 2018 20:18
- refactoring, rearrangement, and cleanup of test_count.py
- documentation in test_count.py
- AlignmentSortOrder class in bam.py
- refactoring further instances of constants to consts.py
@mbabadi mbabadi force-pushed the mb-cell-ranger-func-equivalence branch from 394fb21 to 702a2a4 Compare July 9, 2018 20:25
@mbabadi
Copy link
Contributor Author

mbabadi commented Jul 9, 2018

@ambrosejcarr back to you.

@mbabadi
Copy link
Contributor Author

mbabadi commented Jul 10, 2018

@ambrosejcarr please feel free to correct my grammar in the docs (I am very sloppy with plain language, specially while coding!) + stylistic inconsistencies -- I could notice a few typos/grammatical issues while browsing the diff! :)

Copy link
Member

@ambrosejcarr ambrosejcarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly approve of how you extended from my comments to make some really nice changes. :-)

I want to have a bit more conversation about testing duration, and left a few very minor comments. Besides that I have nothing further.

@@ -231,27 +236,27 @@ def split(in_bam, out_prefix, *tags, approx_mb_per_split=1000, raise_missing=Tru
if len(tags) == 0:
raise ValueError('At least one tag must be passed')

def _cleanup(files_to_counts, files_to_names, rm_all=False) -> None:
def _cleanup(_files_to_counts, _files_to_names, rm_all=False) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters should have types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unclear on why you'd have a leading underscore here. It seems obvious that these methods are private since it's a function with no internal closures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

files_to_counts and files_to_names shadow from the outer scope. Even though this specific is OK, it is generally not a good practice to let inner classes use the outer scope var names (it is a recipe for hard-to-catch bugs!).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, agree. For some reason I'd extrapolated in my mind such that trailing underscores were used, where as weakly private methods have leading underscores.

single_trailing_underscore_: used by convention to avoid conflicts with Python keyword, e.g.

Tkinter.Toplevel(master, class_='ClassName')

However, I couldn't find it in pep8.

@@ -260,8 +265,8 @@ def _cleanup(files_to_counts, files_to_names, rm_all=False) -> None:
warnings.warn('Number of requested subfiles (%d) exceeds 500; this may cause OS errors by '
'exceeding fid limits' % n_subfiles)
if n_subfiles > 1000:
raise ValueError('Number of requested subfiles (%d) exceeds 1000; this will usually cause '
'OS errors, think about increasing max_mb_per_split.' % n_subfiles)
raise ValueError(f'Number of requested subfiles ({ n_subfiles}) exceeds 1000; this will usually cause '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the leading space before n_subfiles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, extracted constants.

current_tag : str
the molecule barcode that records in the group all share

"""
return iter_tag_groups(tag='UB', bam_iterator=bam_iterator)
return iter_tag_groups(tag=consts.MOLECULE_BARCODE_TAG_KEY, bam_iterator=bam_iterator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;-)

@@ -417,9 +421,69 @@ def iter_genes(bam_iterator: Iterator[pysam.AlignedSegment]) -> Generator:
Yields
------
grouped_by_tag : Iterator[pysam.AlignedSegment]
reads sharing a unique gene id (``GE`` tag)
reads sharing a unique gene name tag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes I thought these were gene ids, not gene names. Good catch.

f'malformed GTF file.')


def extract_gene_names(files='-', mode='r', header_comment_char='#') -> Dict[str, int]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your feeling on types for parameters with default arguments? Mypy suggests it because it enforces static typing on them (in addition to annotating them as strings, which is obvious given the default). I think I like this, but I'm open to opinions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added types.

- faster test_count.py (truncated chr1.gtf.gz, pytest.fixture for gene_name_to_index, changed signature of count.from_sorted_tagged_bam)
- misc refactoring and code cleanup
@mbabadi
Copy link
Contributor Author

mbabadi commented Jul 16, 2018

@ambrosejcarr back to you.

Copy link
Member

@ambrosejcarr ambrosejcarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now getting some frustrating and maybe environment-specific installation errors with pysam, but it looks like the tests pass in under 7 seconds, which is good enough for me. Thanks for putting in the extra effort!

[Edit] I was right about environment-specific -- my computer partially uninstalled xcode. 🙃

@mbabadi mbabadi merged commit 9bff1c6 into master Jul 25, 2018
@mbabadi mbabadi deleted the mb-cell-ranger-func-equivalence branch July 25, 2018 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix and improve the test case for CountMatrix
3 participants