initial commit

aliafzal · Feb 23, 2020 · cca6942 · cca6942
1 parent 7c784ff
commit cca6942
Show file tree

Hide file tree

Showing 31 changed files with 51,678 additions and 2 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/README.md b/README.md
@@ -1,2 +1,32 @@
-# Isentification-of-mislabeling-data-using-ML
-Overcome mislabeling errors in genomics training sets by utilizing machine learning on AWS EC2 and Apache Spark.
+ML-mislabel-identification
+
+# Detecting Mislabeled Genomics Data using Machine Learning Models
+
+## Contents
+#### Data and Analysis
+1. The genomics raw data in csv format and its analysis with a readme file are in the [src folder](https://github.com/hpazooki/ML-mislabel-identification/tree/master/src). The libraries used are: pandas & numpy for handling the data, scikit-learn & scipy for analysis, and matplotlib & seaborn for visualizations
+2. Prints of the code ported to Scala for AWS EMR Spark cluster and its results are in the [spark folder](https://github.com/hpazooki/ML-mislabel-identification/tree/master/spark). The rationale behind it is explained in more detail in the final report.
+
+#### Results
+The written research paper is in the [docs folder](https://github.com/zpazooki/ML-mislabel-identification/tree/master/docs). The quantitative results are included in the code notebooks. 
+
+## Summary
+For the final research project in the EE542: Cloud Computing course, my team members and I chose to participate in a data science challege proposed by PrecisionFDA, a cloud-based DNA sequencing data platform developed by the Food and Drug Administration. 
+
+The challenge's objective was to explore various methods of detecting human error in data collection, which is currently a critical roadblock that can occur in quatitative clinical research. The provided datasets were extremely limited by design and the training set was identical in size to the test dataset provided. The challenge was to use a small training set of correctly labeled samples to train a model that is able to detect future errors in a similiar context. For a more detailed explanation, see the challenge proposal blurb in Nature magazine [(/docs/mislabeling_correction_challenge.pdf)](https://github.com/hpazooki/ML-mislabel-identification/blob/master/docs/mislabeling_correction_challenge.pdf). The summary figure is attached below
+![alt text](https://github.com/zpazooki/ML-mislabel-identification/blob/master/img/1.png) 
+
+
+The provided data for the 80 training samples were of two types: genomics data and their correposnding metadata about gender and tumor msi type. The Proteomics and RNA-seq data were extremely wide each consisting of hundreds of columns for each of the 80 samples, 20,000 total attributes for each, which is characteristic of genomics data in general. 
+
+In summary, our approach was to use the provided information in the correctly labeled samples to map each of the two metadata to the proteomics and rna data, and then use that information to identify samples with mislabeled metadata. For example, if according to our model the proteomics data indicates that the patient is male but she is labeled as a female, the sample would be flagged as mislabeled. Our method is laid out in more detail in our research paper [(/docs/report.pdf)](https://github.com/hpazooki/ML-mislabel-identification/blob/master/docs/report.pdf), with the technical summary in its appendix.
+
+The abstract from our paper is attached below. I was responsible for sections II, III, and IV in the report.
+
+<br>
+
+
+
+</br>
+          
+![alt text](https://github.com/zpazooki/ML-mislabel-identification/blob/master/img/2.png)
diff --git a/docs/mislabeling_correction_challenge.pdf b/docs/mislabeling_correction_challenge.pdf
diff --git a/docs/report.pdf b/docs/report.pdf
diff --git a/img/1.png b/img/1.png
diff --git a/img/2.png b/img/2.png
diff --git a/spark/Proteomics - Zeppelin.pdf b/spark/Proteomics - Zeppelin.pdf
diff --git a/spark/README.md b/spark/README.md
@@ -0,0 +1,3 @@
+# README
+
+Above are pdf prints of Zepplin Notebooks created on AWS EMR for cluster computing. The notebooks are almost identical to jupyter notebooks but are more stable for working on Spark, especially when using Scala. 
diff --git a/spark/Random Forest Filtering - Zeppelin.pdf b/spark/Random Forest Filtering - Zeppelin.pdf
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# README

		Above are pdf prints of Zepplin Notebooks created on AWS EMR for cluster computing. The notebooks are almost identical to jupyter notebooks but are more stable for working on Spark, especially when using Scala.