Machine learning Project - mRNA expression data
Test different classification methods (discuss) → Select the best method
Gene expression (Base ML dataset, Raw RNA seq dataset) of five different cancer types.
Preprocessing (Unsupervised learning):
-
Principal Component Analysis(PCA)
-
tSNE
Use different classification methods (Supervised learning):
-
K-Nearest Neighbors
-
Linear Models
-
Naive Bayes Classifiers
-
Decision Trees
-
Kernelized Support Vector Machines(?)
Keep track on what we still have to do. Please update this list with new todo's.
- Update README.
- Investigate preprocessing that is applied to the data.
- Write about preprocessing steps in report.
- Keep track on references in the report.
- Reorganize repository (give logical filenames, restructure folders, etc.).
- Rewrite PCA scripts structure.
- Calculate amount of PC's needed (PCA script).
- Review PCA script (especially investigate explained variation values).
- Download data of different cancer types from Synapse and merge with annotations (also from Synapse).
- Try different Hyperparameters in the ML algorithms (Knn, SVM, ecc) and cross validation
- PCA: try to apply it within cancer types
- find important features → DEG (Differentially expressed genes)
- KEGG analysis (Pathways)
- check for class imbalances (bar plot)