Data input, cleaning and pre-processing

This is the first step of any network analysis. We show here how to load typical expression data, pre-process them into a format suitable for network analysis, and clean the data by removing obvious outlier samples and genes genes.

Data Input
- AnnData format
- Separate matrices
Data cleaning and pre-processing

Input data format

We store raw expression data along information in AnnData format in the geneExpr variable. Gene expression data, gene metadata, and sample metadata can either be passed to PyWGCNA all together in an AnnData object, or separately as a series of matrices.

AnnData format

If you already have your expression data in AnnData format you can define your PyWGCNA object by passing your variable in AnnData format. Keep in mind AnnData.X should be the expression matrix, AnnData.var should contain information about each gene, and AnnData.obs should contain information about each sample. You can read more about the AnnData format here

Separate matrices for gene expression, sample metadata, and gene metadata

The user can pass individual file paths for gene expression, sample metadata, and gene metadata, in the formats specified below.

Gene expression

The expression table should be formatted such that the rows correspond to samples and the columns correspond to genes. The first column should represent the sample id or sample name. The following columns should contain gene ids or gene names which are all unique.

sample_id	ENSMUSG00000000003	ENSMUSG00000000028	ENSMUSG00000000031	ENSMUSG00000000037
sample_11615	12.04	11.56	16.06	13.18
sample_11616	1.35	1.63	1.28	1

Gene metadata

The gene metadata is a table which contains additional information about each gene, such as gene biotype or gene length. Each row should represent a gene and each column should represent a gene feature, where the first columns contains the same gene identifier that was used in the gene expression matrix The rows should be in the same order as the columns of the gene expression matrix, or the user can specify order=False.

gene_id	gene_name	gene_type
ENSMUSG00000000003	Pbsn	protein_coding
ENSMUSG00000000028	Cdc45	protein_coding
ENSMUSG00000000031	H19	lncRNA
ENSMUSG00000000037	Scml2	protein_coding

Sample metadata

The sample metadata is a table which contains additional information about each sample, such as timepoint or genotype. Each row should represent a sample and each column should represent a metadata feature, where the first columns contains the same sample identifier that was used in the gene expression matrix The rows should be in the same order as the rows of the gene expression matrix, or the user can specify order=False.

Sample_id	Age	Tissue	Sex	Genotype
sample_11615	4mon	Cortex	Female	5xFADHEMI
sample_11616	4mon	Cortex	Female	5xFADWT

Other parameters

These are other parameters that can be specified.

name: Name of the WGCNA used to visualize data (default: WGCNA)
save: Whether to save the results of important steps or not (If you want to set it True you should have a write access on the output directory)
outputPath: Where to save your data, otherwise it will be stored in the same directory as the code.
TPMcutoff: TPM cutoff for removing genes
networkType : Type of network to generate ({unsigned, signed and signed hybrid}, default: signed hybrid)
adjacencyType: Type of adjacency matrix to use ({unsigned, signed and signed hybrid}, default: signed hybrid)
TOMType: Type of topological overlap matrix(TOM) to use ({unsigned, signed}, default: signed)

For depth-in documentation on these parameters see here.

Data cleaning and preprocessing

PyWGCNA can clean the input data according to the following criteria:

Remove genes without any expression more than TPMcutoff value (default one) across all samples.
Find genes and samples goodSamplesGenes() function to find genes and samples with too many missing values.
Cluster the samples (uses hierarchical clustering from scipy) to see if there are any obvious outliers. The user can define value the height by specifying the cut value. By default, no samples are removed by hierarchical clustering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data_format.md

Data_format.md

Data input, cleaning and pre-processing

Input data format

AnnData format

Separate matrices for gene expression, sample metadata, and gene metadata

Gene expression

Gene metadata

Sample metadata

Other parameters

Data cleaning and preprocessing

Files

Data_format.md

Latest commit

History

Data_format.md

File metadata and controls

Data input, cleaning and pre-processing

Input data format

AnnData format

Separate matrices for gene expression, sample metadata, and gene metadata

Gene expression

Gene metadata

Sample metadata

Other parameters

Data cleaning and preprocessing