Dataset preparation

Converting various data formats to AnnData (h5ad)

MTX format

This section assumes that you have got these files (names can differ):

matrix.mtx : containing the gene expression matrix
features.tsv : containing the gene names and counts
barcodes.tsv : containing the barcodes of the cells

This format can be converted to AnnData using the following function:

import anndata as ad
import pandas as pd
import scanpy as sc

def load_mtx(mtx_path: str, barcodes_path: str, features_path: str, sample_name: str | None = None) -> ad.AnnData:
    adata = sc.read_mtx(mtx_path).transpose()
    
    barcodes = pd.read_csv(barcodes_path, header=None, sep='\t', names=['barcodes'])
    features = pd.read_csv(features_path, header=None, sep='\t', names=['gene_ids', 'gene_names'])

    adata.obs_names = sample_name + "_" + barcodes['barcodes'] if sample_name is not None else barcodes['barcodes']
    adata.var_names = features['gene_ids']
    adata.var['gene_names'] = features['gene_names'].values
    
    if sample_name is not None:
        adata.obs['sample'] = sample_name

    return adata

The function can be used like this:

sample_names = os.listdir(<dataset dir>)

adata_list = []

for sample_name in tqdm(sample_names[:2]):
    sample_dir = os.path.join(data_path, sample_name)

    matrix_path = os.path.join(sample_dir, "matrix.mtx")
    barcodes_path = os.path.join(sample_dir, "barcodes.tsv")
    features_path = os.path.join(sample_dir, "features.tsv")

    single_adata: ad.AnnData = load_mtx(matrix_path, barcodes_path, features_path, sample_name)

    # Make sure there are not duplicate gene names
    single_adata.var_names_make_unique()

    adata_list.append(single_adata)

You might need to concat the single anndata objects afterwards:

adata = ad.concat(adata_list, join="outer")

Seurat format

This section will be added soon.

Formatting the metadata

In order to make the AnnData file usable with SIMBA, the following metadata needs to be included:

The following metadata fields are required:

Field	Description	Axis	Default
batch	Batch identifier, for integration	obs	required
cell_type	Cell-type annotation	obs	Unknown
condition	The condition of the tissue sample	obs	Unknown
sex	The sex of the patient (`female` or `male`)	obs	Unknown
patient	The patient identifier	obs	required
tissue	The tissue type	obs	required

Batch annotation plays an important role since it will limit the corrections performed on the datasets. Given a dataset, a batch is defined as the group of samples that:

Come from the same study
Are prepared in the same way
Are generated with the same sequencing protocol

Given these informations, a dataset containing e.g. samples coming from healthy individuals, samples coming from tumor tissue and TILs-enriched tumor samples will need ot be divided in 3 batches.

Some additional information:

Patient and batch identifiers will be prepended with the dataset identifier to ensure uniqueness.
Only alphanumeric characters and underscores are allowed for all metadata fields.
The first character must be a letter.

If anything remains unclear, feel free to open a GitHub issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset preparation

Converting various data formats to AnnData (h5ad)

MTX format

Seurat format

Formatting the metadata

Clone this wiki locally