Dataset preparation

Converting various data formats to AnnData (h5ad)

MTX format

This section assumes that you have got a directory structure like this:

<dataset dir>
├── sample1
│   ├── matrix.mtx
│   ├── barcodes.tsv
│   └── features.tsv
├── sample2
│   ├── matrix.mtx
│   ├── barcodes.tsv
│   └── features.tsv
└── ...

If your directory structure differs (e.g. different file names), please adjust the code accordingly.

This format can be converted to AnnData using the following function:

import anndata as ad
import pandas as pd
import scanpy as sc
import os

def load_mtx(mtx_path: str, barcodes_path: str, features_path: str, sample_name: str | None = None) -> ad.AnnData:
    adata = sc.read_mtx(mtx_path).transpose()
    
    barcodes = pd.read_csv(barcodes_path, header=None, sep='\t', names=['barcodes'])
    features = pd.read_csv(features_path, header=None, sep='\t', names=['gene_ids', 'gene_names'])

    adata.obs_names = sample_name + "_" + barcodes['barcodes'] if sample_name is not None else barcodes['barcodes']
    adata.var_names = features['gene_ids']
    adata.var['gene_names'] = features['gene_names'].values
    
    if sample_name is not None:
        adata.obs['sample'] = sample_name

    return adata

If you have multiple samples, you can iterate them like this:

dataset_dir = <dataset dir>
sample_names = os.listdir(dataset_dir)

adata_list = []

for sample_name in sample_names:
    if sample_name.startswith("."):
        # Ignore hidden directories, such as the apple ".DS_Store" directory
        continue

    sample_dir = os.path.join(dataset_dir, sample_name)

    matrix_path = os.path.join(sample_dir, "matrix.mtx")
    barcodes_path = os.path.join(sample_dir, "barcodes.tsv")
    features_path = os.path.join(sample_dir, "features.tsv")

    single_adata: ad.AnnData = load_mtx(matrix_path, barcodes_path, features_path, sample_name)

    # Make sure there are not duplicate gene names
    single_adata.var_names_make_unique()

    adata_list.append(single_adata)

adata = ad.concat(adata_list)

The anndata object can then be saved to disk like this:

adata.write_h5ad("<path>")

Coming from R

RDS files (*.rds, *.RDS) are supported by the pipeline. The conversion to h5ad is done via the R SingleCellExperiment type, so it is essential that the Seurat as.SingleCellExperiment function can be applied to the object. Known working types are:

SingleCellExperiment
Seurat object

Other types might be supported too.

Performing Quality control

We recommend the following guides:

Keep in mind that doublet detection and ambient RNA removal are handled by scRAFIKI. Also, you can intermediately normalize the expression data, but the input to scRAFIKI needs to be raw counts.

Formatting the anndata object

Counts

adata.X should contain raw counts. Also it is recommended to use a scipy sparse matrix format.

Metadata

In order to make the dataset usable with scRAFIKI, the following metadata needs to be included:

Field	Description	Axis	Default
`batch`	Batch identifier, for integration	obs	required
`cell_type`	Cell-type annotation	obs	`Unknown`
`condition`	The condition of the tissue sample	obs	`Unknown`
`sex`	The sex of the patient (`female` or `male`)	obs	`Unknown`
`patient`	The patient identifier	obs	required
`tissue`	The tissue type	obs	required

Some additional information:

Patient and batch identifiers will be prepended with the dataset identifier by scRAFIKI to ensure uniqueness.
Only alphanumeric characters and underscores are allowed for all metadata fields.
The first character must be a letter.
All additional metadata columns will be discarded by scRAFIKI.

Batch annotation

Batch annotation plays an important role since it will limit the corrections performed on the datasets. Given a dataset, a batch is defined as the group of samples that:

Come from the same study
Are prepared in the same way
Are generated with the same sequencing protocol

Given these informations, a dataset containing e.g. samples coming from healthy individuals, samples coming from tumor tissue and TILs-enriched tumor samples will need to be divided in 3 batches.

Other fields

The following fields will be ignored by scRAFIKI:

adata.raw
adata.uns
adata.obsm
adata.varm
adata.obsp
adata.layers

Additionally, only the index of adata.var will be used. If possible, use gene symbols. If no gene symbols are easily available, make sure to set the no_symbols parameter to true for the specific dataset.

If anything remains unclear, feel free to open a GitHub issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly