-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CountMatrix bugfix and new tests #43
Conversation
new tests for count matrix computation from BAM updated count matrix test documentation updated count matrix class documentation basic refactoring and code cleanup
47eb3c3
to
e736196
Compare
Codecov Report
@@ Coverage Diff @@
## master #43 +/- ##
==========================================
+ Coverage 94.67% 94.94% +0.26%
==========================================
Files 24 25 +1
Lines 1727 1997 +270
==========================================
+ Hits 1635 1896 +261
- Misses 92 101 +9
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome. It was really fun to read -- you've obviously done a ton of work with bam files and have added some really cool tricks here.
I left you a bunch of questions and requests which are hopefully not too onerous. :)
In case I missed something, I also reviewed the notebook here: HumanCellAtlas/skylab#118
src/sctools/count.py
Outdated
return self._col_index | ||
|
||
@staticmethod | ||
def _get_alignments_grouped_by_query_name_generator(bam_file: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like arguments on a newline to reduce leading whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/sctools/count.py
Outdated
""" | ||
with pysam.AlignmentFile(bam_file, mode=open_mode) as bam_records: | ||
for alignment in itertools.groupby(bam_records, key=lambda record: record.query_name): | ||
query_name: str = alignment[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this section could be just:
for (query_name, grouper) in itertools.groupby(bam_records, key=lambda record: record.query_name):
alignments: List[pysam.AlignedSegment] = [a for a in grouper] # or list(grouper)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/sctools/gtf.py
Outdated
gene_name = record.get_attribute('gene_name') | ||
if gene_name is None: | ||
raise ValueError( | ||
'Malformed GTF file detected. Record is of type gene but does not have a ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a fan of f-style strings everywhere (although the repository has not yet been converted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, and done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While at it, I converted all exception messages to f-stye in the repo.
sp.save_npz(prefix + '.npz', self._matrix, compressed=True) | ||
np.save(prefix + '_row_index.npy', self._row_index) | ||
np.save(prefix + '_col_index.npy', self._col_index) | ||
|
||
@classmethod | ||
def load(cls, prefix: str): | ||
def load(cls, prefix: str) -> 'CountMatrix': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/sctools/test/test_count.py
Outdated
|
||
|
||
@pytest.mark.parametrize('query_sort_order', ['CB_UB_GE_n', 'n']) | ||
def test_count_matrix_from_bam(query_sort_order): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the tests, it looks like the same data is used for each test, but then sorted differently. Could you factor out the synthetic data generation into a pytest.fixture(scope='module')
such that the data generation is only done once? This could speed up testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion, but let me pass on this one -- it's too much work for little gain. The main speed bottleneck in this test, by and large, is loading chr1 annotations. By the way, it is not uncommon for test suites to take hours to complete! (GATK takes about 1 hour).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, had we used a truncated (or synthetic) set of annotations, we would not have discovered the indexing bug caused by multiple entries per gene!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe pretty strongly in responsive unit testing -- it aids in a fast development cycle. Mocks and fixtures can help with this, and they are not difficult to work with once you get a hang of it. I would bet that @rexwangcc could definitely help with this.
Given what you said about loading the annotations, I'd suggest cutting down on the chr1 annotation sizes, and loading the annotations as a fixture so they only get loaded once for further tests.
Could you use a smaller subset of the annotations but still maintain the same testing quality? This repo is not a great example, as we obviously have testing holes, but the existing tests complete in < 5s, so I think there's a burden to prove that the tests you're adding need to be so much long er.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, truncated chr1 annotations, changed some signatures, added gene names as a fixture. Test is fast now.
src/sctools/test/test_count.py
Outdated
incomplete_alignment_tags_list, | ||
multiple_alignment_tags_list) | ||
|
||
def generate_synthetic_bam_and_counts_matrix(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: for python code, I like to read top to bottom. This function calls a bunch of the private methods below. My tendency would be to move this below those. Do you have any strong opinions on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree -- I moved things around for better code readability.
src/sctools/test/test_count.py
Outdated
def _generate_incomplete_alignment_tags(self, num_missing_some_tags: int) -> List[AlignmentRecordTags]: | ||
incomplete_alignment_tags_list: List[AlignmentRecordTags] = list() | ||
for _ in range(num_missing_some_tags): | ||
tag_mask = self.rng.randint(low=0, high=7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I like this way of masking. Very efficient!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
src/sctools/test/test_count.py
Outdated
_test_annotation_file, | ||
_test_gene_expression_rate) | ||
|
||
with tempfile.TemporaryDirectory() as _test_temp_dir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a finally
clause to explicitly shutil.rmtree
the temp dir -- on some operating systems, these files get left around until restart, so it's nice to clean up after ourselves. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it just to be safe, but I have a feeling that this is redundant. There are several instances of improper usage of tempfile.TemporaryDirectory()
in sctools
and that is likely to be the cause of improper cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, made #45.
@@ -1,13 +1,12 @@ | |||
import fileinput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sorting 👍
src/sctools/test/test_count.py
Outdated
assert np.allclose(csr1.indptr, csr2.indptr) | ||
assert np.allclose(csr1.data, csr2.data) | ||
# assert equality | ||
assert np.allclose(sorted_count_matrix_data_from_bam, sorted_count_matrix_data_expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a docstring to this test that's super clear on what we're testing here? My initial concern was that we're generating synthetic bams and count matrices, and using similar code to transition between them, instead of using some kind of external gold standard that we're aiming to hit.
Done well, these tests are probably way better than some small matrix or bam used to smoke test the functionality, but there's also an opportunity to code 2 + 2 = 5
somewhere into both the count matrix generation and the bam_to_count function.
[moonshot] Longer term, when we've figured out all the discrepancies, what do you think about trying to run the cell ranger code in our tests to verify we're equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an extensive preamble documentation block.
src/sctools/count.py
Outdated
gene_id_tag: str='GE', | ||
open_mode: str='rb', | ||
): | ||
gene_name_tag: str='GE', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question I forgot to ask: are these HGNC gene names that are going in the tags, and not the numerical IDs? If so then this is a good change.
If they are IDs, I'd prefer to keep the name as ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, they actually gene names and not IDs (we reply on dropseq-tools
for tagging genes; it uses gene names by default, there is no way to force it to use gene IDs).
@ambrosejcarr I'm done -- thanks for the thorough review! let me rebase and regroups the commits and it will be good to go for your next review. |
f-style exceptions fastq to FASTQ in fast.py docblocks some code cleanup
90ae59b
to
394fb21
Compare
- refactoring, rearrangement, and cleanup of test_count.py - documentation in test_count.py - AlignmentSortOrder class in bam.py - refactoring further instances of constants to consts.py
394fb21
to
702a2a4
Compare
@ambrosejcarr back to you. |
@ambrosejcarr please feel free to correct my grammar in the docs (I am very sloppy with plain language, specially while coding!) + stylistic inconsistencies -- I could notice a few typos/grammatical issues while browsing the diff! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly approve of how you extended from my comments to make some really nice changes. :-)
I want to have a bit more conversation about testing duration, and left a few very minor comments. Besides that I have nothing further.
src/sctools/bam.py
Outdated
@@ -231,27 +236,27 @@ def split(in_bam, out_prefix, *tags, approx_mb_per_split=1000, raise_missing=Tru | |||
if len(tags) == 0: | |||
raise ValueError('At least one tag must be passed') | |||
|
|||
def _cleanup(files_to_counts, files_to_names, rm_all=False) -> None: | |||
def _cleanup(_files_to_counts, _files_to_names, rm_all=False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These parameters should have types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unclear on why you'd have a leading underscore here. It seems obvious that these methods are private since it's a function with no internal closures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
files_to_counts
and files_to_names
shadow from the outer scope. Even though this specific is OK, it is generally not a good practice to let inner classes use the outer scope var names (it is a recipe for hard-to-catch bugs!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, agree. For some reason I'd extrapolated in my mind such that trailing underscores were used, where as weakly private methods have leading underscores.
single_trailing_underscore_: used by convention to avoid conflicts with Python keyword, e.g.
Tkinter.Toplevel(master, class_='ClassName')
However, I couldn't find it in pep8.
src/sctools/bam.py
Outdated
@@ -260,8 +265,8 @@ def _cleanup(files_to_counts, files_to_names, rm_all=False) -> None: | |||
warnings.warn('Number of requested subfiles (%d) exceeds 500; this may cause OS errors by ' | |||
'exceeding fid limits' % n_subfiles) | |||
if n_subfiles > 1000: | |||
raise ValueError('Number of requested subfiles (%d) exceeds 1000; this will usually cause ' | |||
'OS errors, think about increasing max_mb_per_split.' % n_subfiles) | |||
raise ValueError(f'Number of requested subfiles ({ n_subfiles}) exceeds 1000; this will usually cause ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the leading space before n_subfiles
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, extracted constants.
current_tag : str | ||
the molecule barcode that records in the group all share | ||
|
||
""" | ||
return iter_tag_groups(tag='UB', bam_iterator=bam_iterator) | ||
return iter_tag_groups(tag=consts.MOLECULE_BARCODE_TAG_KEY, bam_iterator=bam_iterator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
;-)
@@ -417,9 +421,69 @@ def iter_genes(bam_iterator: Iterator[pysam.AlignedSegment]) -> Generator: | |||
Yields | |||
------ | |||
grouped_by_tag : Iterator[pysam.AlignedSegment] | |||
reads sharing a unique gene id (``GE`` tag) | |||
reads sharing a unique gene name tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes I thought these were gene ids, not gene names. Good catch.
src/sctools/gtf.py
Outdated
f'malformed GTF file.') | ||
|
||
|
||
def extract_gene_names(files='-', mode='r', header_comment_char='#') -> Dict[str, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your feeling on types for parameters with default arguments? Mypy suggests it because it enforces static typing on them (in addition to annotating them as strings, which is obvious given the default). I think I like this, but I'm open to opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added types.
- faster test_count.py (truncated chr1.gtf.gz, pytest.fixture for gene_name_to_index, changed signature of count.from_sorted_tagged_bam) - misc refactoring and code cleanup
@ambrosejcarr back to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now getting some frustrating and maybe environment-specific installation errors with pysam
, but it looks like the tests pass in under 7 seconds, which is good enough for me. Thanks for putting in the extra effort!
[Edit] I was right about environment-specific -- my computer partially uninstalled xcode. 🙃
This PR fixes some algorithmic issues in the
CountMatrix
class. It also improves related unit tests. The current counting algorithm implemented inCountMatrix.from_sorted_tagged_bam
is expected to be functionally similar to CellRanger methods.Closes #42.