Releases: dcppc/crosscut-metadata
Releases · dcppc/crosscut-metadata
v0.6 release of crosscut metadata model
- v0.6 release of KC7 crosscut metadata model instance for the following datasets:
- GTEx v7 public RNA-Seq and WGS metadata from the GTEx Portal and dbGaP study phs000424.v7.p2
- TOPMed public metadata for dbGaP studies phs001024.v3.p1, phs000951.v2.p2, and phs000179.v5.p2
- Alliance/AGR gene location and ortholog data for mouse and rat reference genomes
- Updated AGR ER diagram to reflect new AGR-specific DATS schema extensions
v0.5 release of crosscut metadata model
- v0.5 release of KC7 crosscut metadata model instance for the following datasets:
- GTEx v7 public RNA-Seq metadata from the GTEx Portal and dbGaP study phs000424.v7.p2 (single file)
- TOPMed public metadata for dbGaP studies phs001024.v3.p1, phs000951.v2.p2, and phs000179.v5.p2
- Combined two public GTEx JSON-LD files into a single file with subjects linked to StudyGroup
and sample-level WGS and RNA-Seq CRAM files represented as individual Datasets. - Added scripts to emulate SPARQL queries directly with RDFLib API calls, to address
SPARQL query performance issues (at least in RDFLib.) - Added example RDFLib script to parse GTEx JSON-LD document and generate tab-delimited summary file.
- Identified and added workarounds for limitations of the DATS validator.
- Updated ER diagram with v0.5 StudyGroup and file/sample-level Dataset encoding.
v0.4 release of crosscut metadata model
- v0.4 release of KC7 crosscut metadata model instance for the following datasets:
- GTEx v7 public RNA-Seq metadata from the GTEx Portal
- GTEx v7 public metadata from dbGaP study phs000424.v7.p2
- TOPMed public metadata for dbGaP studies phs001024.v3.p1, phs000951.v2.p2, and phs000179.v5.p2
- Added context files for all DATS types that did not already have them.
- Fixed broken JSON-LD context file references (issue #16, issue #17).
- Added OBO Foundry context files (issue #23).
- Added missing OBO IRIs for several terms.
- Added bearerOfDisease for hypertension status, using new diseaseStatus property.
- Moved GTEx subject/sample Materials from top-level Dataset to 2nd-level Datasets for consistency with TOPMed DATS JSON encoding and published ER diagrams (issue #26).
- Started using JSON-LD id references to reduce redundancy in the GTEx instance.
- Added a set of example Python RDFLib SPARQL queries showing how to extract study, subject, and sample metadata from the DATS instances (issue #27.)
- Assigned a random uuid to any DATS entity without an existing JSON-LD identifier.
- Added Dataset level dimensions for each of the variables declared in a study.
- Converted subject and sample Material extraProperties to characteristics (issue #24, issue #29).
- Modified conversion code to handle the 8 saliva samples in TOPMed.
- Added proposed pattern for representing the links between individual samples and sequence (or other types of) data files.
- Added --max_output_samples option to bin/gtex_v7_to_dats.py for testing purposes.
v0.3 release of crosscut metadata model
- v0.3 release of KC7 crosscut metadata model instance for the following datasets:
- GTEx v7 public RNA-Seq metadata
- TOPMed public metadata for a single study, phs000946.v3
- MGI C57BL/6J reference genome, annotation, and human orthologs
- v0.3 release of KC7 crosscut metadata model for all of the above, plus:
- TOPMed/dbGaP access-restricted metadata for a single study, phs000946.v3
- Incorporated KC2-provided GUIDs for public GTEx v7 files
- Added new script and preliminary DATS encoding for MGI mouse reference genome, including mouse genes and links to HomoloGene-derived human homologs.
- Fixed broken alternateIdentifiers in TOPMed instance.
v0.2 release of crosscut metadata model
- v0.2 release of KC7 crosscut metadata model for GTEx v7 public RNA-Seq metadata and TOPMed public metadata for a single sample study, phs000946.v3
- RNA extract
Material
s are now linked to the top-levelDataset
via theisAbout
property. - New
DataAnalysis
objects are now linked toDataset
s via theproducedBy
property. - All applicable object types now have
@id
and@context
properties. - JSON-LD contexts have only been defined for a subset of the DATS object types.
- See https://github.com/datatagsuite/schema and https://github.com/datatagsuite/context
- GTEx subject
Material
s now usecharacteristics
, notextraProperties
, to store gender, age range, etc. - The corresponding CV definitions, which used to be in
characteristics
, have been removed, as they are arguably part of the crosscut metadata model, but not the metadata instance per se. - The
extraProperties
property is now used to store the original, un-harmonized metadata. - The public DATS JSON for TOPMed study phs000946.v3 was constructed by picking values from the publicly-available dbGaP variable summaries to produce a single "dummy" subject/sample entry.
- For numeric values the median value was picked.
- For enumerated types the most common value was picked, with ties broken alphanumerically.
- The Python script that produces the TOPMed DATS JSON can also be run on the access restricted/protected metadata (not included in this distribution), in which case it will produce DATS JSON for the actual metadata, and not dummy values based on the variable summaries.
v0.1 KC7 internal-only release of crosscut metadata model
- Initial KC7 internal-only release of KC7 crosscut metadata model for GTEx v7 public RNA-Seq metadata.
- DATS JSON files validate against https://github.com/biocaddie/WG3-MetadataSpecifications master branch
- DATS JSON files do NOT validate against current (v2.2) DATS release
- This is a preview release for internal consumption only and it is subject to change.