add clarity to document and tests to enforce categorical datatypes with string/object values. #1146
Labels
curation software
tech
Tech issues that do not require product prioritization. Tech debt, tooling, ops, etc.
Motivation
There is some ambiguity in the data specification and corresponding test suite regarding the string-encoded categorical datatypes; i.e. the column is of type
CategoricalDtype(..., categories_dtype=string)
orCategoricalDtype(..., categories_dtype=object)
.Without clarity, idiosyncrasies and edge cases exist both in the internal library and with third party integration testers (i.e. lattice team). See this PR #1142 for an example of an issue that emerged.
Additional Details
In the specification for 5.3
assay_ontology_term_id
field, it reads:... meanwhile, for the
{column}_colors
field, it reads:Definition of Done
All dataframe columns in the schema specification document that are expected to be categorical are clearly identified as such. Test fixtures in the testing suite should be updated accordingly and tests should be introduced to test for categorical vs string encoded columns.
Tasks
The text was updated successfully, but these errors were encountered: