Skip to content

Dataset preparation

serareif edited this page Aug 1, 2024 · 33 revisions

Converting various data formats to AnnData (h5ad)

MTX format

This section assumes that you have got a directory structure like this:

<dataset dir>
├── sample1
│   ├── matrix.mtx
│   ├── barcodes.tsv
│   └── features.tsv
├── sample2
│   ├── matrix.mtx
│   ├── barcodes.tsv
│   └── features.tsv
└── ...

If your directory structure differs (e.g. different file names), please adjust the code accordingly.

This format can be converted to AnnData using the following function:

import anndata as ad
import pandas as pd
import scanpy as sc
import os

def load_mtx(mtx_path: str, barcodes_path: str, features_path: str, sample_name: str | None = None) -> ad.AnnData:
    adata = sc.read_mtx(mtx_path).transpose()
    
    barcodes = pd.read_csv(barcodes_path, header=None, sep='\t', names=['barcodes'])
    features = pd.read_csv(features_path, header=None, sep='\t', names=['gene_ids', 'gene_names'])

    adata.obs_names = sample_name + "_" + barcodes['barcodes'] if sample_name is not None else barcodes['barcodes']
    adata.var_names = features['gene_ids']
    adata.var['gene_names'] = features['gene_names'].values
    
    if sample_name is not None:
        adata.obs['sample'] = sample_name

    return adata

If you have multiple samples, you can iterate them like this:

dataset_dir = <dataset dir>
sample_names = os.listdir(dataset_dir)

adata_list = []

for sample_name in sample_names:
    if sample_name.startswith("."):
        # Ignore hidden directories, such as the apple ".DS_Store" directory
        continue

    sample_dir = os.path.join(dataset_dir, sample_name)

    matrix_path = os.path.join(sample_dir, "matrix.mtx")
    barcodes_path = os.path.join(sample_dir, "barcodes.tsv")
    features_path = os.path.join(sample_dir, "features.tsv")

    single_adata: ad.AnnData = load_mtx(matrix_path, barcodes_path, features_path, sample_name)

    # Make sure there are not duplicate gene names
    single_adata.var_names_make_unique()

    adata_list.append(single_adata)

adata = ad.concat(adata_list)

The anndata object can then be saved to disk like this:

adata.write_h5ad("<path>")

Coming from R

RDS files (*.rds, *.RDS) are supported by the pipeline. The conversion to h5ad is done via the R SingleCellExperiment type, so it is essential that the Seurat as.SingleCellExperiment function can be applied to the object. Known working types are:

  • SingleCellExperiment
  • Seurat object

Other types might be supported too.

Performing Quality control

We recommend the following guides:

Keep in mind that doublet detection and ambient RNA removal are handled by scRAFIKI. Also, you can intermediately normalize the expression data, but the input to scRAFIKI needs to be raw counts.

Formatting the anndata object

Counts

adata.X should contain raw counts. Also it is recommended to use a scipy sparse matrix format.

Metadata

In order to make the dataset usable with scRAFIKI, the following metadata needs to be included:

Field Description Axis Default
batch Batch identifier, for integration obs required
cell_type Cell-type annotation obs Unknown
condition The condition of the tissue sample obs Unknown
sex The sex of the patient (female or male) obs Unknown
patient The patient identifier obs required
tissue The tissue type obs required

Some additional information:

  • Patient and batch identifiers will be prepended with the dataset identifier by scRAFIKI to ensure uniqueness.
  • Only alphanumeric characters and underscores are allowed for all metadata fields.
  • The first character must be a letter.
  • All additional metadata columns will be discarded by scRAFIKI.

Batch annotation

Batch annotation plays an important role since it will limit the corrections performed on the datasets. Given a dataset, a batch is defined as the group of samples that:

  • Come from the same study
  • Are prepared in the same way
  • Are generated with the same sequencing protocol

Given these informations, a dataset containing e.g. samples coming from healthy individuals, samples coming from tumor tissue and TILs-enriched tumor samples will need to be divided in 3 batches.

Other fields

The following fields will be ignored by scRAFIKI:

  • adata.raw
  • adata.uns
  • adata.obsm
  • adata.varm
  • adata.obsp
  • adata.layers

Additionally, only the index of adata.var will be used. If possible, use gene symbols. If no gene symbols are easily available, make sure to set the no_symbols parameter to true for the specific dataset.