Genes Manifests Archive for RNA Analysis
This repository holds lists of genes for use in scRNA-seq analysis. They are stored in machine-readable CSV.TSV so they can be used by software, directly from this repository, optionally with commit tags to ensure reproducibility.
Specifically, the files called genes/
species/lists/
list/names/
namespace.tsv
contain, in the first column
(called name
), a list of names of genes of the specified namespace and species. For example, the file
genes/human/lists/transcription_factor/names/GeneSymbol.tsv
contains a list of symbols of human transcription factor genes. You can just access these files directly from github
(using explicit commit tags if you want reproducible results).
The files also contain a additional columns which can help in tracking why the gene name is included in the list, by referring to source files of the list and/or the gene namespaces. These columns and files are described below, but you do not have to deal with this to just use the list.
In general, this repository is meant to be used as a convenient initial starting point for analysis, rather than serve as a "source of truth". It is designed to make it easy to apply lists to to arbitrary data sets regardless of the version of the genes names was used, so the lists include retired/deprecated/aliased/renamed genes. We try to construct the lists such that if a name does not appear in a list, you can be fairly certain that what you are looking up doesn't belong in it (according to any of our sources). If the name does appear in the list, then probably what you are looking up belongs in it, but there's no guarantee.
The analyst is responsible for exercising judgment and common sense when using these lists.
The repository currently holds the following:
Human gene lists:
Namespaces: EnsemblGene, EnsemblProtein, EnsemblTranscript, GeneSymbol, HGNC, RefSeq, UCSC.
Lists:
- Transcription factors: EnsemblGene, EnsemblProtein, EnsemblTranscript, GeneSymbol, HGNC, RefSeq, UCSC.
Mouse gene lists:
Namespaces: EnsemblGene, EnsemblProtein, EnsemblTranscript, GeneSymbol, MGI, RefSeq, UCSC.
Lists:
-
Transcription factors:
EnsemblGene, EnsemblProtein, EnsemblTranscript, GeneSymbol, HGNC, RefSeq, UCSC.
The lists here track only "known" genes and explicitly ignore clones and fragments (e.g. "AC005041.3"). That is, we
assume that to be note-worthy enough to be included in a list (e.g., transcription factors), the gene should be
sufficiently known to be listed in the genes databases, as opposed to being a numbered fragment in some assembly. This
assumption holds well for the well-studied genomes (e.g. human
and mouse
) we are focusing on. It will not hold for
less-studied genomes.
There are too many ways to uniquely identify a gene. We use
Ensembl as our "source of truth" for "what is a gene".
That is, we assume that "different" EnsemblGene identifiers refer to different "genes" (and allow for some EnsemblGene
identifiers to be mapped to others, e.g. when they are retired). For each namespace other than EnsemblGene, we keep
track of the (possibly one-to-many) mapping between its names and EnsemblGene. For example, each GeneSymbol
may map to
one of several EnsemblGene
due to changes in our understanding over time. When other namespace identifiers don't map
to any EnsemblGene identifier, we still accept them, and assume all such identifiers that map to each other refer to the
same "gene" (e.g., HGNC:7424
, MT-CSB1
, MTCSB1
and CSB-I
are all considered to be equivalent, even though there's
no EnsemblGene for it; instead we assign these genes a unique unstable non-Ensembl identifier, e.g.
ENS!000000946
).
The cost of Ensembl identifiers stability is that they aren't human readable. Data sets therefore often use GeneSymbol identifiers instead. We take this mapping from HGNC for human genes and from MGI for mouse genes, which make a valiant effort to put some order into this constantly evolving namespaces.
In general we recommend that for the most reliable results, data sets use EnsemblGene identifiers to uniquely identify genes and look them up in lists, and only use GeneSymbol identifiers for display in graphs and figures for readability.
We also map other stable gene identifiers to EnsemblGene identifiers, for example HGNC and UCSC stable identifiers, based on the mapping specified by the Ensembl database.
For convenience, we also keep track of the mapping from transcript and protein identifiers (EnsemblTranscript
and
EnsemblProtein
namespaces), as well as mapping from RefSeq sequence
identifiers to Ensembl gene identifiers. However, this repository is only concerned with genes; the semantics of a
list of, say, EnsemblProtein identifiers is still "does the gene associated with the protein belong in a list" rather
than "does the specific protein belong in the list". We basically map whatever-it-is to (one or more possible)
EnsemblGene identifier(s) and look that up in the "master" list containing EnsemblGene identifiers.
We provide a version of each list for each of the supported namespaces, based on the mapping we collected between them and EnsemblGene identifiers. To compute each list, we try to isolate the specific EnsemblGene(s) identified by each entry. In general when a list provides multiple names from multiple namespaces for some entry, we intersect the sets of possible EnsemblGenes for these names. If the result is empty, then something is very wrong. If it contains more than one possible EnsemblGene, it is noteworthy, but possible; we include in the list all the EnsemblGenes that map to all the identifiers given for the list entry.
This repository is still in its "alpha" phase - anything may change without notice. This will be updated to "beta" when the structure stabilizes and we have a few useful lists in place for human and mouse. It will be updated to "production" when the structure will be expected never to change again without providing some backward compatibility features.
The repository content is expected to always keep evolving due to updates to the relevant gene namespaces (which are a constantly moving target), and by the addition of new species and/or gene lists, or updates to the existing lists.
Feedback and contributions are welcome!
The data is stored in the following tree:
- All data is under the
genes
directory.- A sub-directory exists for each organism (e.g.,
human
,mouse
).- A
namespaces
sub-directory contains a description of the gene namespaces used.- A
sources
sub-directory contains the source files we collected the gene namespaces from. - A
names
sub-directory contains the actual gene names.
- A
- A
lists
sub-directory` contains the actual gene lists.- Each list is a sub-directory.
- A
sources
sub-directory contains the sources of the gene list. - A
names
sub-directory contains the actual gene names.
- A
- Each list is a sub-directory.
- A
- A sub-directory exists for each organism (e.g.,
Additional sub-directories may be added in the future.
There's no API provided here. The idea is that you can fetch the (simple CSV/TSV) data file(s) you need directly through
the github URLs using wget
, curl
or any other method for fetching HTTP data. That said, it is possible to provide an
API for fetching specific data using these URLs (see for example the Gmara
module of
Metacells.jl.
Since this is a github repository, you can always refer to a specific commit of this repository in the URLs to get the same data. This is useful anywhere reproducibility is important (e.g. vignettes and published results).
Each list is a sub-directory under the lists
sub-directory, holding the following:
-
README.md
contains a free text in markdown format that describes the semantics of the list. -
The
names
sub-directory contains, for each namespace, namespace.tsv
with the following columns:-
name
contains the identifier of the gene in the namespace, in alphabetical order. For example,SOX4
is listed in the human transcription factors list in the GeneSymbol namespace. -
source
contains the source of the mapping from thename
to the EnsemblGene namespace. This is useful for debugging. For exampleHGNC.Current#37136[symbol] : SOX4 => HGNC.Current#37136[symbol -> hgnc_id] : HGNC:11200 => Ensembl.HGNC#20684[HGNC ID -> Gene stable ID] : ENSG00000124766
indicates that theHGNC.Current
source file, in line37136
, contains a columnsymbol
with a valueSOX4
(which is why this GeneSymbol exists). This was mapped toHGNC:11200
by the association between thehgnc_id
and thesymbol
columns of the same line. Finally, the HGNC identifier was mapped toENSG00000124766
by association between theHGNC ID
andGene stable ID
columns in line20684
of the source fileEnsembl.HGNC
. -
ensembl_gene
contains the identifier of the gene in the EnsemblGene namespace, which was included in the list. For example,ENSG00000124766
forSOX4
. If several such identifiers are possible, the file will contain multiple lines. -
ensembl_source
contains the source of the inclusion of the EnsemblGene in the list. This is useful for debugging. For example,Toronto#1146[Ensembl ID] : ENSG00000124766 => Ensembl#20660[Gene stable ID] : ENSG00000124766
indicates theToronto
source file for the list, in line1146
, in theEnsembl ID
column, containedENSG00000124766
. This was verified to be an active identifier in theEnsembl
source file line20660
in theGene stable ID
column. If the identifier was retired, this will contain a mapping from the retired identifier to the active one.
-
-
To compute the above, we begin with a set of manually curated "source of truth" files. These are TSV or CSV files under the
sources
sub-directory which have at least one column containing names (of some namespace). In addition we have a singlesources.yaml
file which contains a sequence of mappings with the following keys, as well as a comment describing the source:-
data_file
holds the name of the CSV or TSV source data file. -
has_header
is a Boolean specifying whether the data file has a header line (default:true
). -
columns
holds a mapping whose key is the column name (or 0-based index if there are no headers), and whose value is a name of a namespace.
The computed canonical list names are any EnsemblGene identifiers that are agreed on by the identifiers in all the columns of each list entry. For other namespaces, these are the identifiers that map to any of these EnsemblGenes. This is computed using
scripts/compute_list.py
. -
"The naming of cats is a difficult matter" - T. S. Eliot
We scrape data from several sources for maintaining a mapping between the different gene name spaces. We use the following data model for handling gene namespaces:
-
We use a separate set of namespaces for each organism. That is, EnsemblGene for human and mouse are two distinct unrelated namespaces. We do not address mapping of genes between species.
-
In each namespace, identifiers may be mapped to each other (to represent changes in the namespace over time). These mappings are directional. For EnsemblGene, we consider all the identifiers that do not map to anything to be "different"; identifiers in all other namespaces are mapped to a set of such "different" identifiers (ideally, to just one).
-
It is assumed that once an identifier was added to a namespace, it is never fully removed. It may be that multiple old identifiers are combined into a new one or that an old identifier is split into several new ones, or possibly that the identifier is "retired" (no longer used) - but would be kept to allow processing old data.
While most namespaces follow this rule, some (Ensembl!) don't make it easy to dump the full list of retired identifiers. Whenever we encounter specific retired identifiers for these namespaces we use Web APIs to fetch their data. That is, our list of identifiers in a namespace "should" contain all active identifiers, it may lack specific retired ones (open an issue if you have a set of specific retired identifiers you want us to add to the data).
In cases of a true ambiguity (an old gene being renamed to a new identifier, and another different gene given the old identifier) we (try to) collect the set of possible EnsemblGenes for the identifier, and hope this ambiguity will be resolved by other identifiers of other namespaces given to the same list entry.
-
Stored identifiers are "normalized". In most namespaces, this means removing the
.[0-9]
version suffix from the name. To lookup a name in a list or a namespace, you need to normalize the query gene name accordingly. The UCSC namespace is an exception in that the.[0-9]
suffix seems to be an inherent part of the identifier. By convention, names are in mixed case, different in different namespaces; for example,SOX4
for the human gene andSox4
for the mouse gene. -
We maintain directed mapping between the namespaces, that is, other than aliases that map a namespace to itself, all mappings bring us closer towards the EnsemblGene namespace.
To compute all the above, we begin with a set of "source of truth" files. These are TSV or CSV files under the sources
sub-directory which have at least two column containing names, to establish links between names (within the same
namespace or across namespaces).
We use a sources.yaml
file which contains a sequence of mappings with the following keys, as well as a comment
describing the source:
-
data_file
holds the name of the CSV or TSV source data file. We omit the.csv
or.tsv
suffix when we identify the source in provenance fields. -
links
holds a sequence of directed links between two identifiers. Each is a mapping with the keysfrom
,to
which contains a mapping with the following keys:column
identifies the column containing the identifier.namespace
contains the name of the namespace the identifier belongs to.separator
is an optional character used to separate multiple identifiers within the column (e.g.,|
).
Links are established either between a namespace and itself (from an alias or renamed identifier to a more current one) or between a namespace and EnsemblGene (or a namespace closer to EnsemblGene).
In addition to the above, the sources
sub-directory optionally contains the following:
-
namespace
.Missing.tsv
contains names and sources we have seen (in some list or data set) that do not exist in any of the source files. This is a temporary file which is read and deleted byscripts/complete_namespace.py
. This happens because some namespaces (Ensembl!) do not list all the names they actually know about in their "dump the whole database" data, because "reasons". -
namespace
.Extra.tsv
contain names and sources for missing names that we fetched from web APIs (usingscripts/complete_namespaces.py
). -
namespace
.Ignored.tsv
contains names and sources that we have looked up in the web APIs and couldn't find any data for. These names are not included in the namespace. Ideally, there shouldn't be any such names. In the best case, they are simply typos; in the worst case, these are names that were used once and were lost to history.
To represent the result, in the names
sub-directory we keep the following files:
-
namespace
.tsv
contains the following columns:-
name
contains the name of the gene in the namespace, in alphabetical order. For example,SOX4
. -
source
contains the source of the name. For exampleHGNC.Current#37136[symbol] : SOX4
indicates thatSOX4
was the value of thesymbol
column of theHGNC.Current
source file in line37136
. -
ensembl_gene
contains the name of the EnsemblGene the gene maps to. For example,ENSG00000124766
forSOX4
. -
ensembl_source
contains the source of the mapping of the gene to the EnsemblGene. For example,HGNC.Current#37136[symbol -> hgnc_id] : HGNC:11200 => Ensembl.HGNC#20684[HGNC ID -> Gene stable ID] : ENSG00000124766
indicates thatSOX
was mapped to the HGNC idHGNC:11200
by the association between columnshgnc_id
andsymbol
in line37136
of theHGNC.Current
source file. This was in turn mapped to the EnsemblGene identifierENSG00000124766
by the association between theHGNC ID
andGene stable ID
columns in line20684
of the source fileEnsembl.HGNC
.
-
We follow all the links, including the extra data, and compute for every identifier, in every namespace, the possible
EnsemblGenes it may map to. This computation is done by scripts/compute_namespaces.py
.
Updating this repository is done by adding new species, namespaces (sources) and lists (sources). Everything is rebuilt
by invoking make
at the top-level directory. If any of the added data refers to missing gene names, you will have to
re-run make
again to update the namespaces based on the recomputed Extra
files. To be certain just re-run make
until it says Nothing to be done
.