Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Ali Afzal authored and Ali Afzal committed Feb 23, 2020
1 parent 7c784ff commit cca6942
Show file tree
Hide file tree
Showing 31 changed files with 51,678 additions and 2 deletions.
Binary file added .DS_Store
Binary file not shown.
34 changes: 32 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,32 @@
# Isentification-of-mislabeling-data-using-ML
Overcome mislabeling errors in genomics training sets by utilizing machine learning on AWS EC2 and Apache Spark.
ML-mislabel-identification

# Detecting Mislabeled Genomics Data using Machine Learning Models

## Contents
#### Data and Analysis
1. The genomics raw data in csv format and its analysis with a readme file are in the [src folder](https://github.com/hpazooki/ML-mislabel-identification/tree/master/src). The libraries used are: pandas & numpy for handling the data, scikit-learn & scipy for analysis, and matplotlib & seaborn for visualizations
2. Prints of the code ported to Scala for AWS EMR Spark cluster and its results are in the [spark folder](https://github.com/hpazooki/ML-mislabel-identification/tree/master/spark). The rationale behind it is explained in more detail in the final report.

#### Results
The written research paper is in the [docs folder](https://github.com/zpazooki/ML-mislabel-identification/tree/master/docs). The quantitative results are included in the code notebooks.

## Summary
For the final research project in the EE542: Cloud Computing course, my team members and I chose to participate in a data science challege proposed by PrecisionFDA, a cloud-based DNA sequencing data platform developed by the Food and Drug Administration.

The challenge's objective was to explore various methods of detecting human error in data collection, which is currently a critical roadblock that can occur in quatitative clinical research. The provided datasets were extremely limited by design and the training set was identical in size to the test dataset provided. The challenge was to use a small training set of correctly labeled samples to train a model that is able to detect future errors in a similiar context. For a more detailed explanation, see the challenge proposal blurb in Nature magazine [(/docs/mislabeling_correction_challenge.pdf)](https://github.com/hpazooki/ML-mislabel-identification/blob/master/docs/mislabeling_correction_challenge.pdf). The summary figure is attached below
![alt text](https://github.com/zpazooki/ML-mislabel-identification/blob/master/img/1.png)


The provided data for the 80 training samples were of two types: genomics data and their correposnding metadata about gender and tumor msi type. The Proteomics and RNA-seq data were extremely wide each consisting of hundreds of columns for each of the 80 samples, 20,000 total attributes for each, which is characteristic of genomics data in general.

In summary, our approach was to use the provided information in the correctly labeled samples to map each of the two metadata to the proteomics and rna data, and then use that information to identify samples with mislabeled metadata. For example, if according to our model the proteomics data indicates that the patient is male but she is labeled as a female, the sample would be flagged as mislabeled. Our method is laid out in more detail in our research paper [(/docs/report.pdf)](https://github.com/hpazooki/ML-mislabel-identification/blob/master/docs/report.pdf), with the technical summary in its appendix.

The abstract from our paper is attached below. I was responsible for sections II, III, and IV in the report.

<br>



</br>
![alt text](https://github.com/zpazooki/ML-mislabel-identification/blob/master/img/2.png)
Binary file added docs/mislabeling_correction_challenge.pdf
Binary file not shown.
Binary file added docs/report.pdf
Binary file not shown.
Binary file added img/1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spark/Proteomics - Zeppelin.pdf
Binary file not shown.
3 changes: 3 additions & 0 deletions spark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# README

Above are pdf prints of Zepplin Notebooks created on AWS EMR for cluster computing. The notebooks are almost identical to jupyter notebooks but are more stable for working on Spark, especially when using Scala.
Binary file added spark/Random Forest Filtering - Zeppelin.pdf
Binary file not shown.
Loading

0 comments on commit cca6942

Please sign in to comment.