Skip to content

Dataset preparation

Nico Trummer edited this page Jan 26, 2024 · 33 revisions

Converting various data formats to AnnData (h5ad)

MTX format

This section assumes that you have got these files (names can differ):

  • matrix.mtx : containing the gene expression matrix
  • features.tsv : containing the gene names and counts
  • barcodes.tsv : containing the barcodes of the cells

This format can be converted to AnnData using the following function:

import anndata as ad
import pandas as pd
import scanpy as sc

def load_mtx(mtx_path: str, barcodes_path: str, features_path: str, sample_name: str | None = None) -> ad.AnnData:
    adata = sc.read_mtx(mtx_path).transpose()
    
    barcodes = pd.read_csv(barcodes_path, header=None, sep='\t', names=['barcodes'])
    features = pd.read_csv(features_path, header=None, sep='\t', names=['gene_ids', 'gene_names'])

    adata.obs_names = sample_name + "_" + barcodes['barcodes'] if sample_name is not None else barcodes['barcodes']
    adata.var_names = features['gene_ids']
    adata.var['gene_names'] = features['gene_names'].values
    
    if sample_name is not None:
        adata.obs['sample'] = sample_name

    return adata

The function can be used like this:

dataset_dir = <dataset dir>
sample_names = os.listdir(dataset_dir)

adata_list = []

for sample_name in sample_names:
    sample_dir = os.path.join(dataset_dir, sample_name)

    matrix_path = os.path.join(sample_dir, "matrix.mtx")
    barcodes_path = os.path.join(sample_dir, "barcodes.tsv")
    features_path = os.path.join(sample_dir, "features.tsv")

    single_adata: ad.AnnData = load_mtx(matrix_path, barcodes_path, features_path, sample_name)

    # Make sure there are not duplicate gene names
    single_adata.var_names_make_unique()

    adata_list.append(single_adata)

You might need to concat the single anndata objects afterwards:

adata = ad.concat(adata_list)

Coming from R

RDS files (*.rds, *.RDS) are supported by the pipeline. The conversion to h5ad is done via the R SingleCellExperiment type, so it is essential that the Seurat as.SingleCellExperiment function can be applied to the object. Known working types are:

  • SingleCellExperiment
  • Seurat object

Other types might be supported too.

Formatting the metadata

In order to make the dataset usable with SIMBA🦁, the following metadata needs to be included:

Field Description Axis Default
batch Batch identifier, for integration obs required
cell_type Cell-type annotation obs Unknown
condition The condition of the tissue sample obs Unknown
sex The sex of the patient (female or male) obs Unknown
patient The patient identifier obs required
tissue The tissue type obs required

Some additional information:

  • Patient and batch identifiers will be prepended with the dataset identifier to ensure uniqueness.
  • Only alphanumeric characters and underscores are allowed for all metadata fields.
  • The first character must be a letter.
  • All additional metadata columns will be discarded.

Batch annotation

Batch annotation plays an important role since it will limit the corrections performed on the datasets. Given a dataset, a batch is defined as the group of samples that:

  • Come from the same study
  • Are prepared in the same way
  • Are generated with the same sequencing protocol

Given these informations, a dataset containing e.g. samples coming from healthy individuals, samples coming from tumor tissue and TILs-enriched tumor samples will need to be divided in 3 batches.