This program converts a simple CSV file into a HuBMAP CCF ASCT+B table.
In the sampledata directory, an included file "demo-input.txt" was generated by Excel using the "demo-input.xlsx" file (Save As "Tab delimited Text"). The generated output is a CSV file. The example file "demo-output.csv" was generated by the program.
See the ChangeLog for the latest developments.
See the Issue Tracker for known issues.
The following assumptions are built into the program.
- The ASCT+B table format allows anatomical structures that are not "leaves" to contain biomarkers or references.
- All anatomical structures must be uniquely named, for example, there can not be two structures called "ovary" but there can be "left ovary" and "right ovary".
- Cell type is only one level.
- Commas can not be used in names for anatomical structures, cells, or features.
- It is assumed that the "author preferred name" is unique across anatomical structures and ontology IDs.
The program performs the following data validation checks.
- Check that there is only one root to the anatomical structure.
- Enforce the parent requirements for anatomical structure. By default an anatomical structure can have multiple parents. For example, the primary ovarian follicle and the primordial ovarian follicle both have a granulosa cell layer. A command line argument can change this behavior such that anatomical structures can have only one parent.
- Check that anatomical structures, cells, biomarkers and references are appropriately defined. By default the program requires all features be explicitely defined, although a command line argument can disable this requirement.
- Check that anatomical structures, cell types, biomarkers and references all have unique names.
- Check that names do not contain commas.
- Check that biomarkers and references are only applied to anatomical structures and cell types.
This program has only been tested on a Mac OS using Python 3. Although it should work on a Linux system.
The program requires the anytree Python package.
https://pypi.org/project/anytree/
The anytree package can be installed as follows.
python3 -m pip install anytree --user
usage: process.py [-h] [-m] [-u] [-d] [-v] input output
Generate ASCT+B table.
positional arguments:
input Input file
output Output file (CSV)
optional arguments:
-h, --help show this help message and exit
-m, --missing Ignore missing cell types, biomarkers and references. For example, if a cell type is marked as containing a biomarker that wasn't defined, this flag would prevent the program from exiting with an error and instead the ASCT+B table would be generated. When the flag isn't used, all features must be defined.
-u, --unique Make sure all anatomical structures have one and only one parent.
-d, --dot Output tree as a DOT file for plotting with Graphviz.
-v, --verbose Print the tree to the terminal.
To process the demo input file and generate a CSV file that can be opened by Excel
process.py <input CSV file> <output CSV file>
process.py demo-input.txt demo-output.csv
The comma delimited file (tab separated is also supported) must contain a header line and the following twelve columns:
NAME (REF DOI) LABEL (REF DETAILS) ID (REF NOTES) NOTE ABBR TYPE CHILDREN CELLS GENES PROTEINS PROTEOFORMS LIPIDS METABOLITES FTUs REFERENCES
The Type value needs to be "AS" for anatomical structures and "CT" for cell types. It doesn't matter what type values are used for the other items, so long as it's not either AS or CT.
Children is a comma separated list of child anatomical structure (AS) objects. These children need to be either anatomical structures (AS). The Cells, Genes, Proteins, Proteoforms, etc fields should be comma separated lists of the appropriate objects (e.g., Cells, should be a comma separated list of relevant cells). In all cases the objects Name or Ref DOI should be used.
The first line in the input file is assumed to contain a header and is ignored.
The following example is incomplete and just included to exemplify the field values and usage:
NAME (REF DOI) LABEL (REF DETAILS) ID (REF NOTES) NOTE ABBR TYPE CHILDREN CELLS GENES PROTEINS PROTEOFORMS LIPIDS METABOLITES FTU REFERENCES (NAME/DOI)
ovary UBERON:0000992 AS central ovary, lateral ovary, medial ovary, mesovarium, ovarian ligament hilum of ovary
central ovary AS central inferior ovary, central superior ovary
lateral ovary AS lateral inferior ovary, lateral superior ovary
medial ovary AS medial inferior ovary, medial superior ovary
mesovarium UBERON:0001342 AS
ovarian ligament UBERON:0008847 AS
hilum of ovary AS ovarian artery, ovarian vein, pampiniform plexus, rete ovarii hilar cell
corona radiata CL:0000713 CT doi:10.1093/oxfordjournals.humrep.a136365
hilar cell CL:0002095 CT alkaline phosphatase, acid phosphatase, non-specific esterase, inhibin, calretinin, melan-A, cholesterol esters McKay et al 1961, Boss et al 1965, Mills et al 2020, Jungbluth et al 1998, Pelkey et al 1998
mural granulosa cell CT doi:10.1093/oxfordjournals.humrep.a136365
primary oocyte CL:0000654 CT doi:10.1093/oxfordjournals.humrep.a136365
secondary oocyte CL:0000655 CT doi:10.1093/oxfordjournals.humrep.a136365
columnar ovarian surface epithelial columnar cell CT calretinin, mesothelin Mills et al 2020, Reeves et al 1971, Hummitzsch et al 2013, Blaustein et al 1979, McKay et al 1961
flattened cuboidal ovarian surface epithelial cell CT oviduct-specific glycoprotein-1, E-cadherin Mills et al 2020, Reeves et al 1971, Hummitzsch et al 2013, Blaustein et al 1979, McKay et al 1961
oviduct-specific glycoprotein-1 Protein
mesothelin Protein
E-cadherin Protein
doi:10.1093/oxfordjournals.humrep.a136365 PMID: 3558758 Reference
McKay et al 1961 McKay, D., Pinkerton, J., Hertig, A. & Danziger, S. (1961). The Adult Human Ovary: A Histochemical Study. Obstetrics & Gynecology, 18(1), 13-39. Reference