Dataset preparation

Converting various data formats to AnnData (h5ad)

MTX format

This section assumes that you have got these files (names can differ):

matrix.mtx : containing the gene expression matrix
features.tsv : containing the gene names and counts
barcodes.tsv : containing the barcodes of the cells

This format can be converted to AnnData using the following function:

import anndata as ad
import pandas as pd
import scanpy as sc

def load_mtx(mtx_path: str, barcodes_path: str, features_path: str, sample_name: str | None = None) -> ad.AnnData:
    adata = sc.read_mtx(mtx_path).transpose()
    
    barcodes = pd.read_csv(barcodes_path, header=None, sep='\t', names=['barcodes'])
    features = pd.read_csv(features_path, header=None, sep='\t', names=['gene_ids', 'gene_names'])

    adata.obs_names = sample_name + "_" + barcodes['barcodes'] if sample_name is not None else barcodes['barcodes']
    adata.var_names = features['gene_ids']
    adata.var['gene_names'] = features['gene_names'].values
    
    if sample_name is not None:
        adata.obs['sample'] = sample_name

    return adata

The function can be used like this:

dataset_dir = <dataset dir>
sample_names = os.listdir(dataset_dir)

adata_list = []

for sample_name in sample_names:
    sample_dir = os.path.join(dataset_dir, sample_name)

    matrix_path = os.path.join(sample_dir, "matrix.mtx")
    barcodes_path = os.path.join(sample_dir, "barcodes.tsv")
    features_path = os.path.join(sample_dir, "features.tsv")

    single_adata: ad.AnnData = load_mtx(matrix_path, barcodes_path, features_path, sample_name)

    # Make sure there are not duplicate gene names
    single_adata.var_names_make_unique()

    adata_list.append(single_adata)

You might need to concat the single anndata objects afterwards:

adata = ad.concat(adata_list)

Coming from R

RDS files (*.rds, *.RDS) are supported by the pipeline. The conversion to h5ad is done via the R SingleCellExperiment type, so it is essential that the Seurat as.SingleCellExperiment function can be applied to the object. Known working types are:

SingleCellExperiment
Seurat object

Other types might be supported too.

Formatting the metadata

In order to make the dataset usable with SIMBA🦁, the following metadata needs to be included:

Field	Description	Axis	Default
`batch`	Batch identifier, for integration	obs	required
`cell_type`	Cell-type annotation	obs	`Unknown`
`condition`	The condition of the tissue sample	obs	`Unknown`
`sex`	The sex of the patient (`female` or `male`)	obs	`Unknown`
`patient`	The patient identifier	obs	required
`tissue`	The tissue type	obs	required

Some additional information:

Patient and batch identifiers will be prepended with the dataset identifier to ensure uniqueness.
Only alphanumeric characters and underscores are allowed for all metadata fields.
The first character must be a letter.
All additional metadata columns will be discarded.

Batch annotation

Batch annotation plays an important role since it will limit the corrections performed on the datasets. Given a dataset, a batch is defined as the group of samples that:

Come from the same study
Are prepared in the same way
Are generated with the same sequencing protocol

Given these informations, a dataset containing e.g. samples coming from healthy individuals, samples coming from tumor tissue and TILs-enriched tumor samples will need to be divided in 3 batches.

If anything remains unclear, feel free to open a GitHub issue

Provide feedback

Saved searches