Skip to content

Commit

Permalink
Update DGE-ML-for-biomarker-discovery.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Omabekee authored Sep 27, 2024
1 parent 15f77fd commit e1e5ae1
Showing 1 changed file with 11 additions and 12 deletions.
23 changes: 11 additions & 12 deletions Stage-3/report/DGE-ML-for-biomarker-discovery.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

## 1. Introduction to Lymphoid Leukemias

Lymphoid leukemias (LL) are blood cancers resulting from the abnormal growth of lymphoid cells- B, T, or NK cells. They are categorised into acute lymphoblastic leukaemia (ALL), common in children, and chronic lymphocytic leukaemia (CLL), which affects adults (Futami & Corey, 2010).
Lymphoid leukemias (LL) are blood cancers resulting from the abnormal growth of lymphoid cells- B, T, or NK cells. They are categorised into acute lymphoblastic leukaemia (ALL), common in children, and chronic lymphocytic leukaemia (CLL), more common in adults, especially the elderly (Chennamadhavuni, 2023).

<figure>
<img src="figures/fig1.png" alt="Figure 1: AI-generated image showing the transition from primary to recurrent lymphoid leukemia, the progression of healthy cells to cancerous ones and their recurrence after treatment" width="600">
Expand All @@ -18,21 +18,21 @@ Lymphoid leukemias (LL) are blood cancers resulting from the abnormal growth of

To identify the key biomarkers associated with primary and recurrent LL samples using differential gene expression analysis and machine learning.

## 2. Description of dataset and Data preprocessing steps
## 2. Description of Dataset and Data Preprocessing Steps

The data was downloaded from The Cancer Genome Atlas (TCGA) database via the GDC data portal. We selected 25 primary and 25 recurrent samples for analysis.

### 2.1 Handling missing values
### 2.1 Handling Missing Values

The LL dataset was preprocessed by checking for missing or blank values using the `is.na` function, to reduce redundancies and understand what is unexpressed/undetected in our data.

### 2.2 Normalisation and Filtering

Normalisation and filtering were performed using the `TCGAnalyze_Normalization`, `TCGAnalyze_Filtering` and `betweenlaneNormalization` functions from the TCGAbiolinks and EDASeq R packages to adjust for gene length and sequencing depth.

## 3. Methodology for biomarker discovery
## 3. Methodology for Biomarker Discovery

### 3.1 Differential gene expression analysis (DGE)
### 3.1 Differential Gene Expression Analysis (DGE)

The analysis was conducted using the `TCGAanalyze_DEA` function from the TCGAbiolinks R package. Comparison was made between "primary" and "recurrent" samples, filtering results by an adjusted p-value < 0.05 and log2 fold change > 1.

Expand All @@ -43,11 +43,11 @@ The analysis was conducted using the `TCGAanalyze_DEA` function from the TCGAbio



### 3.2 Functional enrichment analysis
### 3.2 Functional Enrichment Analysis

Functional enrichment was performed on 104 upregulated and 1,305 downregulated genes using the `enrichGO` function in R.

### 3.3 Pathway visualisation
### 3.3 Pathway Visualisation

The steps involved filtering pathways based on p-value and q-value, calculating the gene ratio and rich factor for each pathway, and visualising the top 20 enriched pathways using `ggplot`.

Expand All @@ -64,7 +64,7 @@ The steps involved filtering pathways based on p-value and q-value, calculating
</figure>


## 4. Methodology for Machine learning analysis
## 4. Methodology for Machine Learning Analysis

### 4.1 Feature Extraction

Expand All @@ -74,7 +74,7 @@ After performing differential gene expression analysis, genes were filtered by s

A random forest classification model was built to classify sample type—either primary or recurrent— using the feature-selected training dataset consisting of 364 genes and 20 samples (10 primary and 10 recurrent). The model was configured with 500 trees in the forest (`ntree = 500`) and 27 genes considered at each split (`mtry = 27`). Model testing and validation were performed on an independent set of 10 samples (5 primary and 5 recurrent).

## 5. Result and Interpretation of model performance
## 5. Result and Interpretation of Model Performance

After the analysis, the model achieved a prediction accuracy of 80%. Out of the 10 samples used for the testing, the model accurately predicted 4/5 Primaries and 4/5 Recurrent cancer samples

Expand All @@ -91,13 +91,12 @@ After the analysis, the model achieved a prediction accuracy of 80%. Out of the

## 6. Conclusion and Future Directions for Research

This project combined machine learning and differential expression analysis to identify key biomarkers in LL. Through DEG and functional enrichment analysis, we were able to identify the molecular changes between the two stages- primary and recurrent. The analysis revealed important genes like AKR1C3, ARHGEF11 and AHNAK which show potential as therapeutic and diagnostic biomarkers for early detection and more effective therapies. Although our random forest classifier achieved 80% prediction accuracy, some primary and recurrent samples were misclassified, indicating areas for improvement.
Future directions include refining the model for higher accuracy, expanding sample sizes and integrating additional data types to provide a more comprehensive understanding of LL.
This project combined machine learning and differential expression analysis to identify key biomarkers in LL. Through DEG and functional enrichment analysis, we were able to identify specific molecular changes between the two stages- primary and recurrent. The analysis revealed important genes like AKR1C3, ARHGEF11 and AHNAK which show potential as therapeutic and diagnostic biomarkers for early detection and more effective therapies. Although our random forest classifier achieved 80% prediction accuracy, some primary and recurrent samples were misclassified, indicating areas for improvement. Future directions include refining the model for higher accuracy, expanding sample sizes and integrating additional data types to provide a more comprehensive understanding of LL.



## References

1. Futami, M., & Corey, S. J. (2010). Signaling Targets in Lymphoid Leukemias. In *Handbook of Cell Signaling* (pp. 2831–2835). Elsevier. [https://doi.org/10.1016/B978-0-12-374145-5.00328-4](https://doi.org/10.1016/B978-0-12-374145-5.00328-4)
1. Chennamadhavuni, A., Lyengar, V., Mukkamalla, S.K.R. & Shimanovsky, A. (2023). Leukemia. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing. Available from: https://www.ncbi.nlm.nih.gov/books/NBK560490/
2. Mounir, M., Lucchetta, M., Silva, T. C., Olsen, C., Bontempi, G., Chen, X., ... & Papaleo, E. (2019). New functionalities in the TCGAbiolinks package for the study and integration of cancer data from GDC and GTEx. *PLoS computational biology*, *15*(3), e1006701.

0 comments on commit e1e5ae1

Please sign in to comment.