Skip to content

Dataset preparation

Nico Trummer edited this page Jan 24, 2024 · 33 revisions

Converting various data formats to AnnData (h5ad)

MTX format

This section assumes that you have got these files (names can differ):

  • matrix.mtx : containing the gene expression matrix
  • features.tsv : containing the gene names and counts
  • barcodes.tsv : containing the barcodes of the cells

This format can be converted to AnnData using the following function:

import anndata as ad
import pandas as pd
import scanpy as sc

def load_mtx(mtx_path: str, barcodes_path: str, features_path: str, sample_name: str | None = None) -> ad.AnnData:
    adata = sc.read_mtx(mtx_path).transpose()
    
    barcodes = pd.read_csv(barcodes_path, header=None, sep='\t', names=['barcodes'])
    features = pd.read_csv(features_path, header=None, sep='\t', names=['gene_ids', 'gene_names'])

    adata.obs_names = sample_name + "_" + barcodes['barcodes'] if sample_name is not None else barcodes['barcodes']
    adata.var_names = features['gene_ids']
    adata.var['gene_names'] = features['gene_names'].values
    
    if sample_name is not None:
        adata.obs['sample'] = sample_name

    return adata

The function can be used like this:

sample_names = os.listdir(<dataset dir>)

adata_list = []

for sample_name in tqdm(sample_names[:2]):
    sample_dir = os.path.join(data_path, sample_name)

    matrix_path = os.path.join(sample_dir, "matrix.mtx")
    barcodes_path = os.path.join(sample_dir, "barcodes.tsv")
    features_path = os.path.join(sample_dir, "features.tsv")

    single_adata: ad.AnnData = load_mtx(matrix_path, barcodes_path, features_path, sample_name)

    # Make sure there are not duplicate gene names
    single_adata.var_names_make_unique()

    adata_list.append(single_adata)

You might need to concat the single anndata objects afterwards:

adata = ad.concat(adata_list, join="outer")

Generating Metadata

Metadata needs to be annotated following the guidlines we provided in the SIMBA ReadMe: Prepare data.

You can find the guidlines for more refined metadata annotation and the batch annotation here.