-
Notifications
You must be signed in to change notification settings - Fork 0
Dataset preparation
This section assumes that you have got a directory structure like this:
<dataset dir>
├── sample1
│ ├── matrix.mtx
│ ├── barcodes.tsv
│ └── features.tsv
├── sample2
│ ├── matrix.mtx
│ ├── barcodes.tsv
│ └── features.tsv
└── ...
If your directory structure differs (e.g. different file names), please adjust the code accordingly.
This format can be converted to AnnData using the following function:
import anndata as ad
import pandas as pd
import scanpy as sc
import os
def load_mtx(mtx_path: str, barcodes_path: str, features_path: str, sample_name: str | None = None) -> ad.AnnData:
adata = sc.read_mtx(mtx_path).transpose()
barcodes = pd.read_csv(barcodes_path, header=None, sep='\t', names=['barcodes'])
features = pd.read_csv(features_path, header=None, sep='\t', names=['gene_ids', 'gene_names'])
adata.obs_names = sample_name + "_" + barcodes['barcodes'] if sample_name is not None else barcodes['barcodes']
adata.var_names = features['gene_ids']
adata.var['gene_names'] = features['gene_names'].values
if sample_name is not None:
adata.obs['sample'] = sample_name
return adata
If you have multiple samples, you can iterate them like this:
dataset_dir = <dataset dir>
sample_names = os.listdir(dataset_dir)
adata_list = []
for sample_name in sample_names:
if sample_name.startswith("."):
# Ignore hidden directories, such as the apple ".DS_Store" directory
continue
sample_dir = os.path.join(dataset_dir, sample_name)
matrix_path = os.path.join(sample_dir, "matrix.mtx")
barcodes_path = os.path.join(sample_dir, "barcodes.tsv")
features_path = os.path.join(sample_dir, "features.tsv")
single_adata: ad.AnnData = load_mtx(matrix_path, barcodes_path, features_path, sample_name)
# Make sure there are not duplicate gene names
single_adata.var_names_make_unique()
adata_list.append(single_adata)
adata = ad.concat(adata_list)
The anndata object can then be saved to disk like this:
adata.write_h5ad("<path>")
RDS
files (*.rds
, *.RDS
) are supported by the pipeline. The conversion to h5ad
is done via the R SingleCellExperiment
type, so it is essential that the Seurat as.SingleCellExperiment
function can be applied to the object. Known working types are:
SingleCellExperiment
Seurat object
Other types might be supported too.
We recommend the following guides:
Keep in mind that doublet detection and ambient RNA removal are handled by scRAFIKI. Also, you can intermediately normalize the expression data, but the input to scRAFIKI needs to be raw counts.
adata.X
should contain raw counts. Also it is recommended to use a scipy sparse matrix format.
In order to make the dataset usable with scRAFIKI, the following metadata needs to be included:
Field | Description | Axis | Default |
---|---|---|---|
batch |
Batch identifier, for integration | obs | required |
cell_type |
Cell-type annotation | obs | Unknown |
condition |
The condition of the tissue sample | obs | Unknown |
sex |
The sex of the patient (female or male ) |
obs | Unknown |
patient |
The patient identifier | obs | required |
tissue |
The tissue type | obs | required |
Some additional information:
- Patient and batch identifiers will be prepended with the dataset identifier by scRAFIKI to ensure uniqueness.
- Only alphanumeric characters and underscores are allowed for all metadata fields.
- The first character must be a letter.
- All additional metadata columns will be discarded by scRAFIKI.
Batch annotation plays an important role since it will limit the corrections performed on the datasets. Given a dataset, a batch is defined as the group of samples that:
- Come from the same study
- Are prepared in the same way
- Are generated with the same sequencing protocol
Given these informations, a dataset containing e.g. samples coming from healthy individuals, samples coming from tumor tissue and TILs-enriched tumor samples will need to be divided in 3 batches.
The following fields will be ignored by scRAFIKI:
adata.raw
adata.uns
adata.obsm
adata.varm
adata.obsp
adata.layers
Additionally, only the index of adata.var
will be used. If possible, use gene symbols. If no gene symbols are easily available, make sure to set the no_symbols
parameter to true
for the specific dataset.
If anything remains unclear, feel free to open a GitHub issue