Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add clarity to document and tests to enforce categorical datatypes with string/object values. #1146

Open
ejmolinelli opened this issue Dec 3, 2024 · 0 comments
Labels
curation software tech Tech issues that do not require product prioritization. Tech debt, tooling, ops, etc.

Comments

@ejmolinelli
Copy link
Contributor

ejmolinelli commented Dec 3, 2024

Motivation

There is some ambiguity in the data specification and corresponding test suite regarding the string-encoded categorical datatypes; i.e. the column is of type CategoricalDtype(..., categories_dtype=string) or CategoricalDtype(..., categories_dtype=object).

Without clarity, idiosyncrasies and edge cases exist both in the internal library and with third party integration testers (i.e. lattice team). See this PR #1142 for an example of an issue that emerged.

Additional Details

In the specification for 5.3 assay_ontology_term_id field, it reads:

categorical with `str` categories. This MUST be an EFO term and either:

... meanwhile, for the {column}_colors field, it reads:

{column}_colors where {column} MUST be the name of a `category` data type column in obs that
is annotated by the data submitter or curator.

Definition of Done

All dataframe columns in the schema specification document that are expected to be categorical are clearly identified as such. Test fixtures in the testing suite should be updated accordingly and tests should be introduced to test for categorical vs string encoded columns.

Tasks

@ejmolinelli ejmolinelli added the tech Tech issues that do not require product prioritization. Tech debt, tooling, ops, etc. label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
curation software tech Tech issues that do not require product prioritization. Tech debt, tooling, ops, etc.
Projects
None yet
Development

No branches or pull requests

2 participants