Skip to content

Dataset preparation

Nico Trummer edited this page Jan 26, 2024 · 33 revisions

Converting various data formats to AnnData (h5ad)

MTX format

This section assumes that you have got these files (names can differ):

  • matrix.mtx : containing the gene expression matrix
  • features.tsv : containing the gene names and counts
  • barcodes.tsv : containing the barcodes of the cells

This format can be converted to AnnData using the following function:

import anndata as ad
import pandas as pd
import scanpy as sc

def load_mtx(mtx_path: str, barcodes_path: str, features_path: str, sample_name: str | None = None) -> ad.AnnData:
    adata = sc.read_mtx(mtx_path).transpose()
    
    barcodes = pd.read_csv(barcodes_path, header=None, sep='\t', names=['barcodes'])
    features = pd.read_csv(features_path, header=None, sep='\t', names=['gene_ids', 'gene_names'])

    adata.obs_names = sample_name + "_" + barcodes['barcodes'] if sample_name is not None else barcodes['barcodes']
    adata.var_names = features['gene_ids']
    adata.var['gene_names'] = features['gene_names'].values
    
    if sample_name is not None:
        adata.obs['sample'] = sample_name

    return adata

The function can be used like this:

sample_names = os.listdir(<dataset dir>)

adata_list = []

for sample_name in tqdm(sample_names[:2]):
    sample_dir = os.path.join(data_path, sample_name)

    matrix_path = os.path.join(sample_dir, "matrix.mtx")
    barcodes_path = os.path.join(sample_dir, "barcodes.tsv")
    features_path = os.path.join(sample_dir, "features.tsv")

    single_adata: ad.AnnData = load_mtx(matrix_path, barcodes_path, features_path, sample_name)

    # Make sure there are not duplicate gene names
    single_adata.var_names_make_unique()

    adata_list.append(single_adata)

You might need to concat the single anndata objects afterwards:

adata = ad.concat(adata_list, join="outer")

Seurat format

Seurat objects are supported by the pipeline. Just make sure to have the required metadata columns available in the object and save it as an .rds/.RDS file. The dataset will be converted to an AnnData file in the beginning of the pipeline.

Formatting the metadata

In order to make the AnnData file usable with SIMBA🦁, the following metadata needs to be included:

The following metadata fields are required:

Field Description Axis Default
batch Batch identifier, for integration obs required
cell_type Cell-type annotation obs Unknown
condition The condition of the tissue sample obs Unknown
sex The sex of the patient (female or male) obs Unknown
patient The patient identifier obs required
tissue The tissue type obs required

Some additional information:

  • Patient and batch identifiers will be prepended with the dataset identifier to ensure uniqueness.
  • Only alphanumeric characters and underscores are allowed for all metadata fields.
  • The first character must be a letter.
  • All additional metadata columns will be discarded.

Batch annotation

Batch annotation plays an important role since it will limit the corrections performed on the datasets. Given a dataset, a batch is defined as the group of samples that:

  • Come from the same study
  • Are prepared in the same way
  • Are generated with the same sequencing protocol

Given these informations, a dataset containing e.g. samples coming from healthy individuals, samples coming from tumor tissue and TILs-enriched tumor samples will need ot be divided in 3 batches.