All the previously described approaches provide valuable and orthogonal information to better understand the molecular basis of human complex traits, including different molecular mechanisms through variant-trait, variant-gene, gene-trait, and gene-gene information. All of them have limitations and offer a partial view of the complex molecular interplay underlying pathophysiological processes. In this section, we describe a gene module-based approach aimed at identifying patterns shared across these data sources. The underlying hypothesis of this approach is that biologically meaningful, disease-relevant gene-gene links will be revealed when different yet incomplete data modalities are integrated. In particular, we show examples of how the integration of gene modules with gene-trait associations and gene-drug data highlights core genes, disease-relevant molecular processes, and drugs' mechanisms of action.
The PhenoPLIER framework [@doi:10.1038/s41467-023-41057-4] is a recent computational framework that implements this gene module-based approach. PhenoPLIER can integrate gene modules extracted from any transcriptome data with gene-trait associations from TWAS and drug-induced transcriptional profiles (Figure {@fig:fig4}). The approach integrates different molecular data sources by using a latent space derived from transcriptome data where each LV is a gene module (Figure {@fig:fig4}a). Then, different data modalities are integrated into this common latent space for a joint analysis, which has been shown to detect important genes missed by standard methods, capture relevant trait clusters, and predict drug-disease links more accurately than state-of-the-art drug repurposing approaches. An advantage of PhenoPLIER is interpretability: it uses gene module models from PLIER (Figure {@fig:fig4}a), which provide not only information about which genes belong to a module but also what the top samples are where those genes are expressed. If present, by looking at the metadata of the RNA-seq datasets, it is possible to infer if those samples represent consistent conditions such as tissues, cell types, or other more complex contexts.
In PhenoPLIER, this latent transcriptional space is described by 987 gene modules derived from recount2 [@doi:10.1038/nbt.3838; @doi:10.1016/j.cels.2019.04.003] (Figure {@fig:fig4}a), a large and highly heterogeneous expression compendium. Given the heterogeneity of the dataset, these gene modules can capture genes expressed across different contexts such as tissues, cell types (across differentiation stages), and different disease states or stimuli. These contexts can be extracted from sample metadata. PhenoPLIER incorporates gene-trait associations computed with the PrediXcan family of TWAS methods [@doi:10.1038/ng.3367; @doi:10.1038/s41467-018-03621-1; @doi:10.1371/journal.pgen.1007889] on two different cohorts: the UK Biobank [@doi:10.1038/s41586-018-0579-z; @doi:10.1126/sciadv.aba2083] as discovery, and the Electronic Medical Records and Genomics (eMERGE) network phase III [@doi:10.1038/gim.2013.72; @doi:10.1101/2021.10.21.21265225] as replication. Gene-trait associations are then projected into the latent space (Figure {@fig:fig4}b), where cluster analysis on traits was performed, and expected and stable groupings of traits were detected, such as asthma and allergies, heel bone-densitometry measurements, hematological assays on red blood cells, physical measures, keratometry measurements, assays on white blood cells and platelets, skin and hair color traits, autoimmune disorders, and cardiovascular diseases. The cardiovascular grouping also included other cardiovascular-related traits such as hand-grip strength [@doi:10/749], and environmental/behavioral factors such as physical activity and diet. The PhenoPLIER approach also proposed a methodology based on interpretable classifiers to detect which gene modules are driving the different clusters. For example, gene modules associated with the red blood cells cluster were 1) well-aligned to pathways related to early progenitors of the erythrocytes lineage, 2) predominantly expressed in early differentiation stages of erythropoiesis, and 3) strongly associated with different assays on red blood cells from TWAS. Other gene modules were associated with the keratometry measures grouping and expressed in corneal endothelial cells, and the grouping with autoimmune disorders was driven by gene modules expressed in T cells. These results show that shared patterns exist between prior knowledge (pathways), gene expression, and gene-trait associations from TWAS.
In PhenoPLIER, information about gene-drug links is incorporated from transcriptional responses to small molecule perturbations profiled in LINCS L1000 [@doi:10.1016/j.cell.2017.10.049]. This information was integrated with gene-trait associations to build a gene module-based drug repurposing framework with the purpose of assessing how substituting LVs for single genes predicted known treatment-disease relationships better (Figure {@fig:fig4}c). Based on an established drug repurposing strategy that matches reversed transcriptome patterns between genes and drug-induced perturbations [@doi:10.1126/scitranslmed.3002648; @doi:10.1126/scitranslmed.3001318], the authors adapted an existing drug repurposing framework [@doi:10.1038/nn.4618] that uses single gene information from TWAS to gene modules in PhenoPLIER. Then, these two approaches, the single gene-based and gene module-based, were compared using a manually curated gold standard set of drug-disease medical indications [@doi:10.7554/elife.26726; @doi:10.5281/zenodo.47664] for 322 drugs across 53 diseases to evaluate the prediction performance. The gene module-based approach outperformed the single gene-based one with an area under the curve of 0.63 vs. 0.57, and average precision of 0.86 vs. 0.64. Although the performance difference in this task was not large, the authors noted that the gene module-based approach represents a compressed version of the entire set of single gene-based results, and the higher performance implied that the low-dimensional latent space used (which necessarily misses some information) captured biologically meaningful gene-gene patterns. Additionally, since gene modules represent interpretable features, the authors found that lipid-related gene modules expressed in adipose tissue and liver were among the top modules contributing to the prediction of high cholesterol and Nicotinic acid (Niacin, which can treat lipid disorders), potentially resembling known mechanisms of action of Niacin, such as decreasing the production of low-density lipoproteins (LDL) either by modulating triglyceride synthesis in hepatocytes or by inhibiting adipocyte triglyceride lipolysis [@doi:10.1016/j.amjcard.2008.02.029].
The PhenoPLIER framework is also able to compute an association between a gene module and a trait by integrating single gene TWAS results (Figure {@fig:fig4}d). The association is computed using a regression model that tests whether genes that strongly belong to a module (using a column in matrix Z in Figure {@fig:fig4}a) are also strongly associated with the trait (using a row in the yellow TWAS matrix in Figure {@fig:fig4}b). For validation, the study conducted a CRISPR-Cas9 screen in the HepG2 (liver) cell line to identify genes associated with lipid regulation and found a high-confidence lipid-increasing gene set that included genes DGAT2 and ACACA, which fit the definition of core genes. The gene module identified as LV246 (Figure {@fig:fig4}e) was the module most strongly enriched with this lipid-increasing gene set, containing DGAT2 and ACACA among the top 15 genes most strongly connected with the module. LV246 was found to be 1) aligned with lipid metabolism pathways, 2) expressed in adipose tissue, liver, and astrocytes, among others, and 3) significantly associated with lipid-related traits in the discovery cohort (UK Biobank; Figure {@fig:fig4}e) and replication cohort (eMERGE), such as high cholesterol, triglycerides, LDL cholesterol, and cholesterol-lowering medications in the UK Biobank, and high cholesterol (hyperlipidemia, phecode 272.11) in eMERGE. Although a direct link between lipids and DGAT2 and ACACA was found by the CRISPR screening, the authors noted that these genes lack strong TWAS associations with any of these lipid-related traits (Figure {@fig:fig4}e, bottom). The GWAS Catalog shows that genetic variants in DGAT2 were found to be associated with cholesterol levels [@url:https://www.ebi.ac.uk/gwas/genes/DGAT2], but ACACA lacks any association with these phenotypes [@url:https://www.ebi.ac.uk/gwas/genes/ACACA]. A potential explanation provided by the study suggests that, based on the omnigenic model, DGAT2 and ACACA could represent core genes, and most of the other genes in LV246 might be peripheral genes that potentially trans regulate them. This example suggests that strong GWAS hits could represent either peripheral or core genes, and that some core genes might be missed by GWAS alone.
Additionally, another interesting trait association in module LV246 was with Alzheimer’s disease (AD) and dementia in the UK Biobank, as well as memory loss (phecode 292.3) in eMERGE, which tends to worsen as AD progresses [@doi:10.31887/DCNS.2013.15.4/hjahn]. The association with AD is relevant since this module was also found to be expressed to some degree in astrocytes and has APOE as one of the top genes (Figure {@fig:fig4}e, bottom). APOE, the gene most strongly associated with late-onset AD [@doi:10.1126/scitranslmed.aaz4564], encodes apolipoprotein E (ApoE), which is involved in lipid transport in the brain [@doi:10.1186/s13024-022-00574-4], a mechanism that, when dysregulated, may play an important role in AD pathogenesis. From TWAS alone, it can be seen that APOE is strongly associated and colocalized with the same traits as LV246 (Figure {@fig:fig4}e, bottom). However, a recent systematic survey across clinical trials that evaluate ApoE-targeted drugs for AD found modest to no efficacy in the treatment for this disease [@doi:10.1212/WNL.0000000000204861]. Although there is strong evidence that APOE is a causal gene for AD, it still remains to be determined whether it represents a core gene or plays a more peripheral role. A gene module-based approach might help prioritize core genes that remain elusive when using standard single gene strategies.
Another approach to compute an association between gene modules and traits is MAGMA gene-set analysis [@doi:10.1371/journal.pcbi.1004219]. In fact, the regression model employed by PhenoPLIER is based on MAGMA. The difference is that PhenoPLIER uses gene-trait association from the PrediXcan family of TWAS methods, while MAGMA does not incorporate eQTL data and uses a proximity-based approach in linking variants with genes. Another difference is that MAGMA allows for conditional and interaction gene-set analysis to account for correlated gene modules, and this approach was found to be useful in detecting novel pathways in the context of blood pressure [@doi:10.1038/s41467-018-06022-6]. The use of eQTL data by TWAS, however, might better fit the gene regulatory network assumptions of the omnigenic model than proximity-based approaches like MAGMA.
Approaches based on gene modules have the potential to go beyond prior knowledge and capture patterns that might be unique to different disease datasets. First, although a gene module can be aligned with a pathway, this does not mean that it is restricted to prior knowledge as in standard pathway analyses. Being aligned with pathways means that a module resembles a known mechanism, which helps in interpretability and in distinguishing it from other patterns that might be related to technical noise. However, the module could also capture other genes that are not part of the pathway but potentially involved in the function. Second, gene modules can be extracted from very large, heterogeneous datasets such as recount2, or more specific ones such as the Human Trisome Project, which includes gene expression data from people with Down Syndrome (DS). Approaches such as PLIER and PhenoPLIER have been fundamental in extracting DS-specific gene modules related to obesity, a common co-occurring condition [@Nandi2024].
A disadvantage of the PhenoPLIER approach is that gene modules are generated by an algorithm, and as such, they could represent artifacts or be aligned with technical noise. Interpretable methods such as PLIER and some VAE models that help segregate technical noise from relevant biology are key to solving this problem. Another limitation is the potential difficulty in determining the contexts in which a gene module is expressed. In the example of LV256 and lipid-related traits, the module was also found to be expressed, to varying degrees, in tumor samples, skin, skeletal muscle, and other contexts that might be hard to interpret. The interpretability advantage is also limited by the quality of the RNA-seq metadata. Additionally, compared with GWAS and TWAS, where the unit is an objective molecular entity (a SNP or gene in a particular chromosome and position), a gene module not only might or might not exist, but the algorithm could fail to capture all gene modules relevant for a particular study, which could bias any downstream analysis. Finally, a gene module, equivalent to a gene co-expression network, represents a set of correlated genes' expression in some conditions, but it does not provide information about gene-gene links (it's not directed) and could include many false positives [@doi:10.1038/s41576-023-00618-5]. Future approaches should also incorporate other data modalities to refine a gene module, such as transcription factor binding data or chromatin accessibility.
Although there are several limitations when we only correlate gene pairs across RNA-seq samples [@doi:10.1038/s41576-023-00618-5], integrating gene networks with other data modalities has the potential to identify disease-relevant molecular mechanisms [@doi:10.1371/journal.pgen.1008519]. Here, we describe an approach that integrates gene modules with other data sources, such as gene-trait and gene-drug information. The underlying hypothesis of this approach is that biologically meaningful gene-gene links will be captured when different data modalities that at least partially capture these links are integrated. In our examples, these data modalities are gene modules from transcriptome data, single gene-trait associations, and single gene-drug links via drug-induced transcriptional profiles. We also show that if the unsupervised approach used to extract a gene module is also interpretable (i.e., provides information about the specific contexts, such as cell types or tissues, that explain why those genes were grouped together), the data integration approach also sheds light on potential context-specific transcriptional mechanisms.