update: read me

Stuartlab-UCSC · Jul 3, 2019 · 4ed36b3 · 4ed36b3
1 parent baaaaa1
commit 4ed36b3
Showing 1 changed file with 7 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # cluster-solution-format
 This repo contains a draft specification of a [json schema](https://github.com/Stuartlab-UCSC/cluster-solution-format/blob/master/cluster_solution_schema.json) for cluster solutions created from single cell mRNA seq data. 
 A cluster solution is loosely defined as the output of any clustering algorithm, e.g. louvain, k-means, or more domain 
-specific algorithms such as SC3.  
+specific algorithms such as [SC3](https://www.nature.com/articles/nmeth.4236).  
 
 The schema mirrors this data model:
 
@@ -10,25 +10,24 @@ The schema mirrors this data model:
 
 Along with metadata about the cluster solution, 3 major types of information can detail each cluster.
 
-  * cell assignments: The categorical assignment of cells to clusters, i.e. the output of a cluster algorithm.
-  *	markers: Gene markers for identifying individual clusters.
+  * cell assignments: The categorical (or probabilistic) assignment of cells to clusters, i.e. the output of a cluster algorithm.
+  *	markers: Arbitrary metrics for genes as markers that identify individual clusters.
   *	cell type annotation: A label that maps a cluster to a cell type.  
 
-Cell assignments is the only type of information required for a valid cluster solution json object.
+The cell assignments field is the only information required for a valid cluster solution json object.
 
 #### Some thoughts on other options for formats:
 
 Three classes of formats that spring to mind are a json specification as represented here, .csv or .tsv files, and loom 
 and other matrix annotation files.
 
-The advantage of tab or comma separated value files seems to be their ease and readiness to be pulled into analysis 
-pipelines. Where these formats fail is the ability to track important metadata about the cluster solution, e.g. the 
-algorithm that produced the cluster solution.
+The advantage of tab or comma separated value files seems to be their simplicity, they are easy to pull into analysis 
+pipelines and language indifferent. Where these formats fail is the ability to track important metadata about the cluster solution, e.g. the algorithm that produced the cluster solution.
 
 Other matrix file formats such as loom or more language specific formats such as Seurat and Scanpy objects are also 
 easily integrated into analysis. While these formats are more general and can carry metadata for cluster solutions they 
 are currently lacking standards to document and validate their schemas. Understanding the schema of Loom, Scanpy, or 
-Suerat requires reading objects into memory and manually investigating their annotation fields. These types of objects 
+Suerat often requires reading objects into memory and manually investigating their annotation fields. These types of objects 
 are very good at "it just works" analysis pipelines, but problems with interoperability commonly arise due to a lack of 
 standardization in their structure.