-
Notifications
You must be signed in to change notification settings - Fork 0
Dataset preparation
This section assumes that you have got these files (names can differ):
-
matrix.mtx
: containing the gene expression matrix -
features.tsv
: containing the gene names and counts -
barcodes.tsv
: containing the barcodes of the cells
This format can be converted to AnnData using the following function:
import anndata as ad
import pandas as pd
import scanpy as sc
def load_mtx(mtx_path: str, barcodes_path: str, features_path: str, sample_name: str | None = None) -> ad.AnnData:
adata = sc.read_mtx(mtx_path).transpose()
barcodes = pd.read_csv(barcodes_path, header=None, sep='\t', names=['barcodes'])
features = pd.read_csv(features_path, header=None, sep='\t', names=['gene_ids', 'gene_names'])
adata.obs_names = sample_name + "_" + barcodes['barcodes'] if sample_name is not None else barcodes['barcodes']
adata.var_names = features['gene_ids']
adata.var['gene_names'] = features['gene_names'].values
if sample_name is not None:
adata.obs['sample'] = sample_name
return adata
The function can be used like this:
sample_names = os.listdir(<dataset dir>)
adata_list = []
for sample_name in tqdm(sample_names[:2]):
sample_dir = os.path.join(data_path, sample_name)
matrix_path = os.path.join(sample_dir, "matrix.mtx")
barcodes_path = os.path.join(sample_dir, "barcodes.tsv")
features_path = os.path.join(sample_dir, "features.tsv")
single_adata: ad.AnnData = load_mtx(matrix_path, barcodes_path, features_path, sample_name)
# Make sure there are not duplicate gene names
single_adata.var_names_make_unique()
adata_list.append(single_adata)
You might need to concat the single anndata objects afterwards:
adata = ad.concat(adata_list, join="outer")
This section will be added soon.
In order to make the AnnData file usable with SIMBA, the following metadata needs to be included:
The following metadata fields are required:
Field | Description | Axis | Default |
---|---|---|---|
batch | Batch identifier, for integration | obs | required |
cell_type | Cell-type annotation | obs | Unknown |
condition | The condition of the tissue sample | obs | Unknown |
sex | The sex of the patient (female or male ) |
obs | Unknown |
patient | The patient identifier | obs | required |
tissue | The tissue type | obs | required |
Batch annotation plays an important role since it will limit the corrections performed on the datasets. Given a dataset, a batch is defined as the group of samples that:
- Come from the same study
- Are prepared in the same way
- Are generated with the same sequencing protocol
Given these informations, a dataset containing e.g. samples coming from healthy individuals, samples coming from tumor tissue and TILs-enriched tumor samples will need ot be divided in 3 batches.
Some additional information:
- Patient and batch identifiers will be prepended with the dataset identifier to ensure uniqueness.
- Only alphanumeric characters and underscores are allowed for all metadata fields.
- The first character must be a letter.
If anything remains unclear, feel free to open a GitHub issue