-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: atac fragment processing suggestion #1284
base: main
Are you sure you want to change the base?
Conversation
- fix dask warning. - run fast anndata tests first.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1284 +/- ##
==========================================
- Coverage 89.70% 89.29% -0.41%
==========================================
Files 20 21 +1
Lines 2341 2373 +32
==========================================
+ Hits 2100 2119 +19
- Misses 241 254 +13
🚀 New features to boost your workflow:
|
organism_ontology_term_ids = ad.io.read_elem(f["obs"])["organism_ontology_term_id"].unique().astype(str) | ||
if organism_ontology_term_ids.size > 1: | ||
error_message = ( | ||
"Anndata.obs.organism_ontology_term_id must have a unique value. Found the following values:\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if curators are fine with it, np, but this error message reads a little strangely to me. How about 'must have exactly 1 unique value."?
@@ -143,7 +143,7 @@ def check_anndata_requires_fragment(anndata_file: str) -> bool: | |||
""" | |||
onto_parser = OntologyParser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: pin to a schema_version? anndata validation is doing so, we should do this to avoid potential mismatches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you import the existing instance from validate.py
? It would keep the versioned instances in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved ONTOLOGY_PARSER to it's own files to share across modules
@@ -19,7 +19,7 @@ | |||
|
|||
from .utils import is_ontological_descendant_of | |||
|
|||
logger = logging.getLogger(__name__) | |||
logger = logging.getLogger("cellxgene-schema") | |||
|
|||
# TODO: these chromosome tables should be calculated from the fasta file? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: spoke to Trent about this in irl sync. Agreed that this issue should be tracked as a fast-follow for atac-seq validation, as this table should be aligned to the GENCODE version we're using for each pertinent species
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely agree here! It's just waiting to get out of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
def validate_anndata(anndata_file: str) -> list[str]: | ||
errors = [validate_anndata_organism_ontology_term_id(anndata_file), validate_anndata_is_primary_data(anndata_file)] | ||
return report_errors("Errors found in Anndata file", errors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a warning / note that because the anndata failed these basic checks, we could not validate the fragment-based rules (to account for someone seeing these, fixing them, then being surprised when they get new errors)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed do "Errors found in Anndata file. Skipping fragment validation."
try: | ||
fragment_required = check_anndata_requires_fragment(h5ad_file) | ||
if fragment_required: | ||
logger.info("Andata requires an ATAC fragment file.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit - Andata
typo?
try: | ||
fragment_required = check_anndata_requires_fragment(h5ad_file) | ||
if fragment_required: | ||
logger.info("Andata requires an ATAC fragment file.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info("Andata requires an ATAC fragment file.") | |
logger.info("Anndata requires an ATAC fragment file.") |
if fragment_required: | ||
logger.info("Andata requires an ATAC fragment file.") | ||
else: | ||
logger.info("Andata does not require an ATAC fragment file.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same typo
else: | ||
logger.info("Andata does not require an ATAC fragment file.") | ||
except Exception as e: | ||
report_errors("Andata does not support ATAC fragment files for the follow reason", [str(e)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also follow reason --> following reason
# convert the fragment to a parquet file for faster processing | ||
try: | ||
parquet_file = convert_to_parquet(fragment_file, tempdir) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to propagate the Exception up the to the error message?
It looks like for the int columns in the fragment file, I will get a pandas parse exception with different dtypes.
For the chromosome column (col 1), there's this exception if there's a value that's not part of the expected categories:
pyarrow.lib.ArrowInvalid: No non-null segments were available for field 'chromosome'; couldn't infer type
And for the the barcode column, I think any given value is coerced to a string, so then the I get the validation error 'Barcodes don't match anndata.obs.index'
If it's too much to have more specific error messages for why the conversion failed, then maybe a message like : "Error converting fragment to parquet, check that dtypes for fragment file columns are consistent/match the schema"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There errors should appear that way now
|
||
def report_errors(header: str, errors: list[str]) -> list[str]: | ||
if any(errors): | ||
errors = [f"{i}: {e})" for i, e in enumerate(errors) if e is not None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enumerate is a nice touch but I'd say not needed as long as each error is its own string/prints on a new line. Makes it easier to just check against the error instead of an error string with the enumeration plus the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some more notes, but largely looks good. thanks!
Reason for Change
Changes
read support
is not <= 0Testing
version to PyPI without explicit QA + sign-off from Lattice on all functional CLI changes. They may install the package
version at HEAD of main with
Notes for Reviewer