asctb_ct_label_mapper
is a package to ensure controlled vocabulary for annotations of scRNA-seq datasets. The goal is to enable cross-dataset or cross-experiment comparison of data by aligning annotations to a standard reference point.
Given a specific organ's scRNA-seq annotated dataset (.h5ad/.rds), you can create a translation file for mapping raw-labels to the ASCT+B naming convention.
- Create the reference-embeddings by fetching the corresponding ASCT+B organ (with latest version):
- Fetch the ASCT+B dataset from the ASCT+B Master Tables.
- Parse the data to create wrangled 3 columns
CT-ID
,CT-Name
,CT-Label
. - Fetch
Description
of each uniqueCT-ID
from Cell Ontology. - Use NLP-preprocessing best practices for the text fields.
- Use a
Sentence-Transformer
model hosted on Hugging Face to create embeddings of shapecx768
(c
is the Number of unique CTs in the ASCT+B Master table).
-
For each input raw Cell-Type annotation/cluster label, create the embedding and compare it against the embeddings generated in step #1.
-
Identify the best matching ASCT+B label for the input raw label.
-
You can also visualize the agreeability of cross-dataset annotations before and after using ASCTB CT Label Mapper.
A walkthrough is available on Google Colab here.
Expert provides feedback in order to finalize the translation from query annotation label to ASCT+B annotation label.