Update DGE-ML-for-biomarker-discovery.md

Omabekee · Sep 27, 2024 · e1e5ae1 · e1e5ae1
1 parent 15f77fd
commit e1e5ae1
Showing 1 changed file with 11 additions and 12 deletions.
diff --git a/Stage-3/report/DGE-ML-for-biomarker-discovery.md b/Stage-3/report/DGE-ML-for-biomarker-discovery.md
@@ -6,7 +6,7 @@
 
 ## 1. Introduction to Lymphoid Leukemias
 
-Lymphoid leukemias (LL) are blood cancers resulting from the abnormal growth of lymphoid cells- B, T, or NK cells. They are categorised into acute lymphoblastic leukaemia (ALL), common in children, and chronic lymphocytic leukaemia (CLL), which affects adults (Futami & Corey, 2010).
+Lymphoid leukemias (LL) are blood cancers resulting from the abnormal growth of lymphoid cells- B, T, or NK cells. They are categorised into acute lymphoblastic leukaemia (ALL), common in children, and chronic lymphocytic leukaemia (CLL), more common in adults, especially the elderly (Chennamadhavuni, 2023).
 
 <figure>  
   <img src="figures/fig1.png" alt="Figure 1: AI-generated image showing the transition from primary to recurrent lymphoid leukemia, the progression of healthy cells to cancerous ones and their recurrence after treatment" width="600">  
@@ -18,21 +18,21 @@ Lymphoid leukemias (LL) are blood cancers resulting from the abnormal growth of
 
 To identify the key biomarkers associated with primary and recurrent LL samples using differential gene expression analysis and machine learning.
 
-## 2. Description of dataset and Data preprocessing steps
+## 2. Description of Dataset and Data Preprocessing Steps
 
 The data was downloaded from The Cancer Genome Atlas (TCGA) database via the GDC data portal. We selected 25 primary and 25 recurrent samples for analysis.
 
-### 2.1 Handling missing values
+### 2.1 Handling Missing Values
 
 The LL dataset was preprocessed by checking for missing or blank values using the `is.na` function, to reduce redundancies and understand what is unexpressed/undetected in our data.
 
 ### 2.2 Normalisation and Filtering
 
 Normalisation and filtering were performed using the `TCGAnalyze_Normalization`, `TCGAnalyze_Filtering` and `betweenlaneNormalization` functions from the TCGAbiolinks and EDASeq R packages to adjust for gene length and sequencing depth.
 
-## 3. Methodology for biomarker discovery
+## 3. Methodology for Biomarker Discovery
 
-### 3.1 Differential gene expression analysis (DGE)
+### 3.1 Differential Gene Expression Analysis (DGE)
 
 The analysis was conducted using the `TCGAanalyze_DEA` function from the TCGAbiolinks R package. Comparison was made between "primary" and "recurrent" samples, filtering results by an adjusted p-value < 0.05 and log2 fold change > 1.
 
@@ -43,11 +43,11 @@ The analysis was conducted using the `TCGAanalyze_DEA` function from the TCGAbio
 
 
 
-### 3.2 Functional enrichment analysis
+### 3.2 Functional Enrichment Analysis
 
 Functional enrichment was performed on 104 upregulated and 1,305 downregulated genes using the `enrichGO` function in R.
 
-### 3.3 Pathway visualisation
+### 3.3 Pathway Visualisation
 
 The steps involved filtering pathways based on p-value and q-value, calculating the gene ratio and rich factor for each pathway, and visualising the top 20 enriched pathways using `ggplot`. 
 
@@ -64,7 +64,7 @@ The steps involved filtering pathways based on p-value and q-value, calculating
 </figure>
 
 
-## 4. Methodology for Machine learning analysis
+## 4. Methodology for Machine Learning Analysis
 
 ### 4.1 Feature Extraction
 
@@ -74,7 +74,7 @@ After performing differential gene expression analysis, genes were filtered by s
 
 A random forest classification model was built to classify sample type—either primary or recurrent— using the feature-selected training dataset consisting of 364 genes and 20 samples (10 primary and 10 recurrent). The model was configured with 500 trees in the forest (`ntree = 500`) and 27 genes considered at each split (`mtry = 27`). Model testing and validation were performed on an independent set of 10 samples (5 primary and 5 recurrent).
 
-## 5. Result and Interpretation of model performance
+## 5. Result and Interpretation of Model Performance
 
 After the analysis, the model achieved a prediction accuracy of 80%. Out of the 10 samples used for the testing, the model accurately predicted 4/5 Primaries and 4/5 Recurrent cancer samples
 
@@ -91,13 +91,12 @@ After the analysis, the model achieved a prediction accuracy of 80%. Out of the
 
 ## 6. Conclusion and Future Directions for Research
 
-This project combined machine learning and differential expression analysis to identify key biomarkers in LL. Through DEG and functional enrichment analysis, we were able to identify the molecular changes between the two stages- primary and recurrent. The analysis revealed important genes like AKR1C3, ARHGEF11 and AHNAK which show potential as therapeutic and diagnostic biomarkers for early detection and more effective therapies. Although our random forest classifier achieved 80% prediction accuracy, some primary and recurrent samples were misclassified, indicating areas for improvement.  
-Future directions include refining the model for higher accuracy, expanding sample sizes and integrating additional data types to provide a more comprehensive understanding of LL.
+This project combined machine learning and differential expression analysis to identify key biomarkers in LL. Through DEG and functional enrichment analysis, we were able to identify specific molecular changes between the two stages- primary and recurrent. The analysis revealed important genes like AKR1C3, ARHGEF11 and AHNAK which show potential as therapeutic and diagnostic biomarkers for early detection and more effective therapies. Although our random forest classifier achieved 80% prediction accuracy, some primary and recurrent samples were misclassified, indicating areas for improvement. Future directions include refining the model for higher accuracy, expanding sample sizes and integrating additional data types to provide a more comprehensive understanding of LL.
 
 
 
 ## References
 
-1. Futami, M., & Corey, S. J. (2010). Signaling Targets in Lymphoid Leukemias. In *Handbook of Cell Signaling* (pp. 2831–2835). Elsevier. [https://doi.org/10.1016/B978-0-12-374145-5.00328-4](https://doi.org/10.1016/B978-0-12-374145-5.00328-4)  
+1. Chennamadhavuni, A., Lyengar, V., Mukkamalla, S.K.R. & Shimanovsky, A. (2023). Leukemia. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing. Available from: https://www.ncbi.nlm.nih.gov/books/NBK560490/ 
 2. Mounir, M., Lucchetta, M., Silva, T. C., Olsen, C., Bontempi, G., Chen, X., ... & Papaleo, E. (2019). New functionalities in the TCGAbiolinks package for the study and integration of cancer data from GDC and GTEx. *PLoS computational biology*, *15*(3), e1006701.