From 1bcefeee8363c5b6916a76a4803a11dd5a002f93 Mon Sep 17 00:00:00 2001 From: Brian Raymor Date: Tue, 28 Jan 2025 11:04:29 -0800 Subject: [PATCH 1/2] Updates for fragments --- schema/drafts/5.3.0.md | 479 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 477 insertions(+), 2 deletions(-) diff --git a/schema/drafts/5.3.0.md b/schema/drafts/5.3.0.md index 33044139..ed744136 100644 --- a/schema/drafts/5.3.0.md +++ b/schema/drafts/5.3.0.md @@ -412,7 +412,7 @@ Curators MUST annotate the following columns in the `obs` dataframe: the most accurate descendant of "EFO:0010183" for single cell library construction excluding "EFO:0010961" for Visium Spatial Gene Expression while allowing its descendants If assay_ontology_term_id is either a descendant of "EFO:0010961" for Visium Spatial Gene Expression or "EFO:0030062" for Slide-seqV2 then all observations MUST contain the same value.

- If assay_ontology_term_id is either "EFO:0010891" for scATAC-seq or a descendant, there are additional requirements for separate fragment file assets documented in scATAC-seq assets.

+ If assay_ontology_term_id is either "EFO:0010891" for scATAC-seq or a descendant, there are additional requirements for separate fragments file assets documented in scATAC-seq assets.

An assay based on 10X Genomics products SHOULD be the most accurate descendant of "EFO:0008995" for 10x technology. An assay based on SMART (Switching Mechanism at the 5' end of the RNA Template) or SMARTer technology SHOULD either be "EFO:0010184" for Smart-like or preferably its most accurate descendant.


Recommended values for specific assays:

@@ -2360,7 +2360,482 @@ When a dataset is uploaded, CELLxGENE Discover MUST automatically add the `schem paired assay. `obs['assay_ontology_term_id']` is a descendant of both "EFO:0010891" for scATAC-seq and "EFO:0008913" for single-cell RNA sequencing -unpaired assay. `obs['assay_ontology_term_id']` is "EFO:0010891" for scATAC-seq or a descendant but is not a descendant of "EFO:0008913" for single-cell RNA sequencing +unpaired assay. `obs['assay_ontology_term_id']` is "EFO:0010891" for scATAC-seq or a descendant and is not a descendant of "EFO:0008913" for single-cell RNA sequencing + +### Requirements + +A Dataset MUST meet all of the following requirements to be eligible for scATAC-seq assets: +* obs['assay_ontology_term_id'] values MUST all be either paired assays or unpaired assays +* obs['is_primary_data'] values MUST be all True +* var['feature_reference'] values MUST include one of "NCBITaxon:9606" for Homo sapiens or "NCBITaxon:10090" for Mus musculus, but not both. The value determines the required Chromosome Table. + +If the obs['assay_ontology_term_id'] values are all paired assays then the Dataset MAY have a fragments file asset. + +If the obs['assay_ontology_term_id'] values are all unpaired assays then the Dataset MUST have a fragments file asset. + +## scATAC-seq Asset: Submitted Fragment File + +This MUST be a gzipped tab-separated values (TSV) file. + +The curator MUST annotate the following header-less columns. Additional columns and header lines beginning with `#` MUST NOT be included. + +### first column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Value + str. This MUST be the reference genome chromosome the fragment is located on.

If the values ofvar['feature_reference'] in the associated Dataset include "NCBITaxon:9606" for Homo sapiens then the first column value MUST be a value from the Chromosome column in the Human Chromosome Table.

+ If the values of var['feature_reference'] in the associated Dataset include "NCBITaxon:10090" for Mus musculus then the first column value MUST be a value from the Chromosome columnin the Mouse Chromosome Table. +
+
+ + +### second column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valueint. This MUST be the 0-based start coordinate of the fragment. +
+
+ +### third column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valueint. This MUST be the 0-based end coordinate of the fragment. The end position is exclusive, representing the position immediately following the fragment interval. The value MUST be greater than the start coordinate specified in the second column and less than or equal to the Length of the Chromosome specified in the first column, as specified in the appropriate Chromosome Table. +
+
+ +### fourth column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valuestr. This MUST be an observation identifier from the obs index of the associated Dataset. Every obs index value of the associated Dataset MUST appear at least once in this column. +
+
+ +### fifth column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valueint. This MUST be the total number of read pairs associated with this fragment. The value MUST be 1 or greater. +
+
+ +## scATAC-seq Asset: Processed Fragments File + +From every submitted fragments file asset, CELLxGENE Discover MUST generate {dataset_version_id}-fragments.tsv.gz, a tab-separated values (TSV) file position-sorted and compressed by bgzip. + +## scATAC-seq Asset: Fragments File index + +From every processed fragments file asset, CELLxGENE Discover MUST generate {dataset_version_id}-fragments.tsv.gz.tbi, a tabix index of the fragment intervals from the fragments file. + +## Chromosome Tables + +Chromosome Tables are determined by the reference assembly for the gene annotation versions pinned in this version of the schema. Only chromosomes or scaffolds that have at least one gene feature present are included. + +### Human (GRCh38.p14) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ChromosomeLength
chr1248956422
chr2242193529
chr3198295559
chr4190214555
chr5181538259
chr6170805979
chr7159345973
chr8145138636
chr9138394717
chr10133797422
chr11135086622
chr12133275309
chr13114364328
chr14107043718
chr15101991189
chr1690338345
chr1783257441
chr1880373285
chr1958617616
chr2064444167
chr2146709983
chr2250818468
chrX156040895
chrY57227415
chrM16569
GL000009.2201709
GL000194.1191469
GL000195.1182896
GL000205.2185591
GL000213.1164239
GL000216.2176608
GL000218.1161147
GL000219.1179198
GL000220.1161802
GL000225.1211173
KI270442.1392061
KI270711.142210
KI270713.140745
KI270721.1100316
KI270726.143739
KI270727.1448248
KI270728.11872759
KI270731.1150754
KI270733.1179772
KI270734.1165050
KI270744.1168472
KI270750.1148850
+ +### Mouse (GRCm39) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ChromosomeLength
chr1195154279
chr2181755017
chr3159745316
chr4156860686
chr5151758149
chr6149588044
chr7144995196
chr8130127694
chr9124359700
chr10130530862
chr11121973369
chr12120092757
chr13120883175
chr14125139656
chr15104073951
chr1698008968
chr1795294699
chr1890720763
chr1961420004
chrX169476592
chrY91455967
chrM16299
GL456210.1169725
GL456211.1241735
GL456212.1153618
GL456219.1175968
GL456221.1206961
GL456239.140056
GL456354.1195993
GL456372.128664
GL456381.125871
GL456385.135240
JH584295.11976
JH584296.1199368
JH584297.1205776
JH584298.1184189
JH584299.1953012
JH584303.1158099
JH584304.1114452
--- From a9a25a4d6cf040bed52cfabd52dd710bb987870a Mon Sep 17 00:00:00 2001 From: Brian Raymor Date: Tue, 28 Jan 2025 11:09:41 -0800 Subject: [PATCH 2/2] Added table links --- schema/drafts/5.3.0.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/schema/drafts/5.3.0.md b/schema/drafts/5.3.0.md index ed744136..d8a35c36 100644 --- a/schema/drafts/5.3.0.md +++ b/schema/drafts/5.3.0.md @@ -2389,8 +2389,8 @@ The curator MUST annotate the following header-less columns. Additional columns Value - str. This MUST be the reference genome chromosome the fragment is located on.

If the values ofvar['feature_reference'] in the associated Dataset include "NCBITaxon:9606" for Homo sapiens then the first column value MUST be a value from the Chromosome column in the Human Chromosome Table.

- If the values of var['feature_reference'] in the associated Dataset include "NCBITaxon:10090" for Mus musculus then the first column value MUST be a value from the Chromosome columnin the Mouse Chromosome Table. + str. This MUST be the reference genome chromosome the fragment is located on.

If the values ofvar['feature_reference'] in the associated Dataset include "NCBITaxon:9606" for Homo sapiens then the first column value MUST be a value from the Chromosome column in the Human Chromosome Table.

+ If the values of var['feature_reference'] in the associated Dataset include "NCBITaxon:10090" for Mus musculus then the first column value MUST be a value from the Chromosome column in the Mouse Chromosome Table.