diff --git a/Stage-4/report/lgg-analysis-report.md b/Stage-4/report/lgg-analysis-report.md index 9208b4d..9f40520 100644 --- a/Stage-4/report/lgg-analysis-report.md +++ b/Stage-4/report/lgg-analysis-report.md @@ -6,11 +6,26 @@ --- +## Table of Contents +1. [Introduction](#1-introduction-to-low-grade-glioma) +2. [Dataset and Data Preprocessing](#2-description-of-dataset-and-data-preprocessing-steps) +3. [Methodology for Biomarker Discovery](#3-methodology-for-biomarker-discovery) +4. [Methodology for Machine Learning Analysis](#4-methodology-for-machine-learning) +5. [Result and Interpretation](#5-result-and-interpretation-of-model-performance) +6. [Conclusion and Future Directions for Research](#6-conclusion-and-future-directions-for-research) +7. [References](#references) + +--- ## 1. Introduction to Low-Grade Glioma Low-Grade Gliomas (LGGs) are slow-growing brain tumors classified as Grade II gliomas by the World Health Organization (Ravanpay *et al*., 2018). Despite their slower growth, they can infiltrate brain tissue and progress to more aggressive forms. A key biomarker in LGG is the IDH mutation, which is associated with better prognosis, while IDH-wild type tumors tend to behave more aggressively (Solomou *et al*., 2023). +
+ Figure 1:  Analysis Workflow +
Figure 2: Volcano plot showing the significant genes between mutant and wild type LGG samples
+
+ ### 1.1 Project Aim: @@ -37,7 +52,7 @@ The analysis used the `TCGAanalyze_DEA` function from the TCGAbiolinks R package
Figure 2:  Volcano plot showing the significant genes between mutant and wild type LGG samples -
Figure 2: Volcano plot showing the significant genes between mutant and wild type LGG samples
+
Figure 2: Volcano Plot Showing Significant Genes
@@ -75,20 +90,17 @@ After performing DGE, genes were filtered by selecting those with a LogFC > 1 an A random forest classification model was built to classify mutation status—mutant or wildtype— using the feature-selected training dataset consisting of 123 genes and 360 samples. 100 genes considered at each split (`mtry = 100`). Model testing was performed on an independent set of 153 samples. ### 4.2 KNN for Predicting IDH Status - -### 4.2.1 Feature extraction - We built a K-Nearest Neighbors (KNN) model to predict IDH status (mutant or wildtype) based on gene expression across samples. After thorough data preprocessing, cleaning, and filtering, we applied the `topPreds` function to identify the top 1000 predictors with the highest standard deviation, which were then used to train the model.
Figure 5: Boxplot after log transformation -
Figure 5: Boxplot after log transformation
+
Figure 5: Boxplot After Log Transformation
-### 4.2.2 Model Training and Testing +### 4.2.1 Model Training and Testing The data was split into 70:30, resulting in 375 samples for training and 159 for testing. We used cross-validation (via the `ctrl.lgg` function) to ensure thorough sampling and reduce bias during training. The optimal value of k was determined using `knn.lgg$bestTune`, which identified the best `k = 1`. @@ -98,19 +110,19 @@ We built a K-Nearest Neighbors (KNN) model to predict IDH status (mutant or wild
Figure 6: Confusion matrices (CM) summarising the performance of the model on the test data to give insight into the precision of the classification approach -
Figure 6: Confusion matrices (CM) summarising the performance of the model on the test data to give insight into the precision of the classification approach
+
Figure 6: Confusion Matrices Summarising the Performance of the Model
Figure 7: Summary of performance metrics for random forest model -
Figure 7: Summary of performance metrics for random forest model
+
Figure 7: Summary of Performance Metrics for Random Forest Model
Figure 8: Top 20 genes by gini importance that helped the model performance -
Figure 8: Top 20 genes by gini importance that helped the model performance
+
Figure 8: Top 20 Genes By Gini Importance
@@ -122,7 +134,7 @@ The model achieved a prediction accuracy of 91.5%. The model accurately predicte
Figure 9: Confusion matrices (CM) summarising the performance of the model on the test data -
Figure 9: Confusion matrices (CM) summarising the performance of the model on the test data
+
Figure 9: Confusion Matrices Summarising Model Performance