Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: atac fragment processing suggestion #1284

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Bento007
Copy link
Contributor

@Bento007 Bento007 commented Mar 4, 2025

Reason for Change

Changes

  • silencing dask distributed log messages to make output easier to understand.
  • reformat error message to make them easier to find in the output.
  • check if anndata is atac before any other checks.
  • perform anndata checks before fragments checks to speed of error feedback loop.
  • check that read support is not <= 0
  • check for mismatch chromosome with organism
  • return organisms that are not allowed in the error response.
  • catch pandas error when converting to parquet.

Testing

  • Either list QA steps or reasoning you feel QA is unnecessary
  • Reminder For CLI changes: upon merge, contact Lattice for final sign-off. Do not release a new cellxgene-schema
    version to PyPI without explicit QA + sign-off from Lattice on all functional CLI changes. They may install the package
    version at HEAD of main with
pip install git+https://github.com/chanzuckerberg/single-cell-curation/@main#subdirectory=cellxgene_schema_cli

Notes for Reviewer

Bento007 added 2 commits March 4, 2025 14:45
- fix dask warning.
- run fast anndata tests first.
Copy link

codecov bot commented Mar 4, 2025

Codecov Report

Attention: Patch coverage is 71.42857% with 22 lines in your changes missing coverage. Please review.

Project coverage is 89.29%. Comparing base (ded6f4c) to head (912e5c6).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1284      +/-   ##
==========================================
- Coverage   89.70%   89.29%   -0.41%     
==========================================
  Files          20       21       +1     
  Lines        2341     2373      +32     
==========================================
+ Hits         2100     2119      +19     
- Misses        241      254      +13     
Components Coverage Δ
cellxgene_schema_cli 89.93% <71.42%> (-0.58%) ⬇️
migration_assistant 91.26% <ø> (ø)
schema_bump_dry_run_genes 79.74% <ø> (ø)
schema_bump_dry_run_ontologies 99.53% <ø> (ø)
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

organism_ontology_term_ids = ad.io.read_elem(f["obs"])["organism_ontology_term_id"].unique().astype(str)
if organism_ontology_term_ids.size > 1:
error_message = (
"Anndata.obs.organism_ontology_term_id must have a unique value. Found the following values:\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if curators are fine with it, np, but this error message reads a little strangely to me. How about 'must have exactly 1 unique value."?

@@ -143,7 +143,7 @@ def check_anndata_requires_fragment(anndata_file: str) -> bool:
"""
onto_parser = OntologyParser()
Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pin to a schema_version? anndata validation is doing so, we should do this to avoid potential mismatches

Copy link
Contributor

@ejmolinelli ejmolinelli Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you import the existing instance from validate.py? It would keep the versioned instances in sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved ONTOLOGY_PARSER to it's own files to share across modules

@@ -19,7 +19,7 @@

from .utils import is_ontological_descendant_of

logger = logging.getLogger(__name__)
logger = logging.getLogger("cellxgene-schema")

# TODO: these chromosome tables should be calculated from the fasta file?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: spoke to Trent about this in irl sync. Agreed that this issue should be tracked as a fast-follow for atac-seq validation, as this table should be aligned to the GENCODE version we're using for each pertinent species

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely agree here! It's just waiting to get out of sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def validate_anndata(anndata_file: str) -> list[str]:
errors = [validate_anndata_organism_ontology_term_id(anndata_file), validate_anndata_is_primary_data(anndata_file)]
return report_errors("Errors found in Anndata file", errors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a warning / note that because the anndata failed these basic checks, we could not validate the fragment-based rules (to account for someone seeing these, fixing them, then being surprised when they get new errors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed do "Errors found in Anndata file. Skipping fragment validation."

try:
fragment_required = check_anndata_requires_fragment(h5ad_file)
if fragment_required:
logger.info("Andata requires an ATAC fragment file.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit - Andata typo?

try:
fragment_required = check_anndata_requires_fragment(h5ad_file)
if fragment_required:
logger.info("Andata requires an ATAC fragment file.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info("Andata requires an ATAC fragment file.")
logger.info("Anndata requires an ATAC fragment file.")

if fragment_required:
logger.info("Andata requires an ATAC fragment file.")
else:
logger.info("Andata does not require an ATAC fragment file.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same typo

else:
logger.info("Andata does not require an ATAC fragment file.")
except Exception as e:
report_errors("Andata does not support ATAC fragment files for the follow reason", [str(e)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same typo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also follow reason --> following reason

# convert the fragment to a parquet file for faster processing
try:
parquet_file = convert_to_parquet(fragment_file, tempdir)
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to propagate the Exception up the to the error message?

It looks like for the int columns in the fragment file, I will get a pandas parse exception with different dtypes.

For the chromosome column (col 1), there's this exception if there's a value that's not part of the expected categories:
pyarrow.lib.ArrowInvalid: No non-null segments were available for field 'chromosome'; couldn't infer type

And for the the barcode column, I think any given value is coerced to a string, so then the I get the validation error 'Barcodes don't match anndata.obs.index'

If it's too much to have more specific error messages for why the conversion failed, then maybe a message like : "Error converting fragment to parquet, check that dtypes for fragment file columns are consistent/match the schema"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There errors should appear that way now


def report_errors(header: str, errors: list[str]) -> list[str]:
if any(errors):
errors = [f"{i}: {e})" for i, e in enumerate(errors) if e is not None]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enumerate is a nice touch but I'd say not needed as long as each error is its own string/prints on a new line. Makes it easier to just check against the error instead of an error string with the enumeration plus the error message.

Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some more notes, but largely looks good. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants