-
Notifications
You must be signed in to change notification settings - Fork 0
Dataset preparation
Nico Trummer edited this page Jan 24, 2024
·
33 revisions
This section assumes that you have got these files (names can differ):
-
matrix.mtx
: containing the gene expression matrix -
features.tsv
: containing the gene names and counts -
barcodes.tsv
: containing the barcodes of the cells
This format can be converted to AnnData using the following function:
import anndata as ad
import pandas as pd
import scanpy as sc
def load_mtx(mtx_path: str, barcodes_path: str, features_path: str, sample_name: str | None = None) -> ad.AnnData:
adata = sc.read_mtx(mtx_path).transpose()
barcodes = pd.read_csv(barcodes_path, header=None, sep='\t', names=['barcodes'])
features = pd.read_csv(features_path, header=None, sep='\t', names=['gene_ids', 'gene_names'])
adata.obs_names = sample_name + "_" + barcodes['barcodes'] if sample_name is not None else barcodes['barcodes']
adata.var_names = features['gene_ids']
adata.var['gene_names'] = features['gene_names'].values
if sample_name is not None:
adata.obs['sample'] = sample_name
return adata
The function can be used like this:
sample_names = os.listdir(<dataset dir>)
adata_list = []
for sample_name in tqdm(sample_names[:2]):
sample_dir = os.path.join(data_path, sample_name)
matrix_path = os.path.join(sample_dir, "matrix.mtx")
barcodes_path = os.path.join(sample_dir, "barcodes.tsv")
features_path = os.path.join(sample_dir, "features.tsv")
single_adata: ad.AnnData = load_mtx(matrix_path, barcodes_path, features_path, sample_name)
# Make sure there are not duplicate gene names
single_adata.var_names_make_unique()
adata_list.append(single_adata)
You might need to concat the single anndata objects afterwards:
adata = ad.concat(adata_list, join="outer")
Metadata needs to be annotated following the guidlines we provided in the SIMBA ReadMe: Prepare data.
You can find the guidlines for more refined metadata annotation and the batch annotation here.
If anything remains unclear, feel free to open a GitHub issue