Skip to content

Commit

Permalink
update: read me
Browse files Browse the repository at this point in the history
  • Loading branch information
mrJeppard authored Jul 3, 2019
1 parent baaaaa1 commit 4ed36b3
Showing 1 changed file with 7 additions and 8 deletions.
15 changes: 7 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# cluster-solution-format
This repo contains a draft specification of a [json schema](https://github.com/Stuartlab-UCSC/cluster-solution-format/blob/master/cluster_solution_schema.json) for cluster solutions created from single cell mRNA seq data.
A cluster solution is loosely defined as the output of any clustering algorithm, e.g. louvain, k-means, or more domain
specific algorithms such as SC3.
specific algorithms such as [SC3](https://www.nature.com/articles/nmeth.4236).

The schema mirrors this data model:

Expand All @@ -10,25 +10,24 @@ The schema mirrors this data model:

Along with metadata about the cluster solution, 3 major types of information can detail each cluster.

* cell assignments: The categorical assignment of cells to clusters, i.e. the output of a cluster algorithm.
* markers: Gene markers for identifying individual clusters.
* cell assignments: The categorical (or probabilistic) assignment of cells to clusters, i.e. the output of a cluster algorithm.
* markers: Arbitrary metrics for genes as markers that identify individual clusters.
* cell type annotation: A label that maps a cluster to a cell type.

Cell assignments is the only type of information required for a valid cluster solution json object.
The cell assignments field is the only information required for a valid cluster solution json object.

#### Some thoughts on other options for formats:

Three classes of formats that spring to mind are a json specification as represented here, .csv or .tsv files, and loom
and other matrix annotation files.

The advantage of tab or comma separated value files seems to be their ease and readiness to be pulled into analysis
pipelines. Where these formats fail is the ability to track important metadata about the cluster solution, e.g. the
algorithm that produced the cluster solution.
The advantage of tab or comma separated value files seems to be their simplicity, they are easy to pull into analysis
pipelines and language indifferent. Where these formats fail is the ability to track important metadata about the cluster solution, e.g. the algorithm that produced the cluster solution.

Other matrix file formats such as loom or more language specific formats such as Seurat and Scanpy objects are also
easily integrated into analysis. While these formats are more general and can carry metadata for cluster solutions they
are currently lacking standards to document and validate their schemas. Understanding the schema of Loom, Scanpy, or
Suerat requires reading objects into memory and manually investigating their annotation fields. These types of objects
Suerat often requires reading objects into memory and manually investigating their annotation fields. These types of objects
are very good at "it just works" analysis pipelines, but problems with interoperability commonly arise due to a lack of
standardization in their structure.

Expand Down

0 comments on commit 4ed36b3

Please sign in to comment.