-
Notifications
You must be signed in to change notification settings - Fork 2
Design Doc
Overview (1 pt)
Our input data consists of an abundance data, in which the rows are colony forming units (CFUs) and the columns are patient IDs (we may get different data down the road), which correlate to disease state. There are a total of 124 types of bacteria, 682 patients and 4 disease states. Patients are given an ID in which a character(s) denotes disease/symptom status. C corresponds to healthy controls n = 150, E urinary tract infection n=157, OAB overactive bladder symptoms n=75 , UUI urge urinary incontinence n=150, and SUI n=150 stress urinary incontinence. The total number of CFUs per patient is also recorded.
Our goal is to take this data, and run it through a machine learning algorithm such that the model can predict symptom status from abundance data. First, we must clean the data (removing any patients who did not have CFUs), then divide the various disease groups into training, testing and validation folds, and write the model. Once the model is written, we will train, test and validate it using the previously mentioned fold groups. The proposed machine learning models (ML) we are planning to implement are: Logistic Regression, KNN, Elastic Net, SVR, Random Forest. This will be accomplished via the use of the python scikit library. Each team member will complete the code for at least one model, who does the fourth model will be determined later. After the algo is built, team members will train test and validate the model together. Validation may be performed on a separate dataset – unknown at this point. For testing training and validation, only one disease state will compared to the controls at a time, i.e control to uti but not control to uti to overactive bladder symptoms.
We plan to test the performance of the algo using area under the curve and accuracy scores. Other methods may be explored.
Context (1 pt)
Infections are often not dependent on a single bacterial taxa, but a community of bacteria. Infections caused by a community of bacteria are known as polymicrobial infections and are traditionally dealt with via the administration of broad spectrum antibiotics. This course of treatment is not only effective in eradicating the infection, but the whole microbial community of the patient. This is less than ideal as many bacteria are symbiotes within the body. Symbiotic bacteria perform a variety of tasks integral to the health of the patient, some assist in digestion, while others still have been implicated in fighting infections themselves. These bacteria should not be caught in the onslaught of treatment, as removing them from the patient is detrimental to their health. Our goal is to develop an algorithm, which can take patient abundance data and accurately correlate symptom status with bacterial taxa, such that a more direct treatment may be used to fight the infection.
Goals & Non-Goals (1 pt)
Goals
- Our main goal is to figure out what groups of organisms correspond to a symptom.
- Implement the scikit library to accomplish this
- Implement logistic regression, KNN, Elastic Net, SVR and Random Forest machine learning models
Non Goals - Attempting to find/ use other models than ones specified, it is not our job to build a better model then requested at the current moment.
Proposed Solution (2 pts)
We will be running multiple ML models (Logistic Regression, KNN, Elastic Net, SVR, Random Forest) in order to identify which combinations of microbial abundances and combinations lead to a particular symptom. The data will be split up into training, testing, and validation data using cross-fold validation, and we will identify the optimal parameters using a grid search algorithm. All of these analyses will be done using the python libraries scikit-learn and pandas.
Milestones (2 pts)
Write ya comments / docs as you go plz.
Week 1
- Work through Scikit tutorial (Only group members who have not taken machine learning Link to tutorial )
- Get data- determine best way of building the model with data at hand
Week 2
- Grid search – Karolina and Sareh
- Write a parser to separate the data between symptom status ( should be easy)
- Separate data into folds for training, testing, and validation ( devise some way of splitting the various symptom status somewhat equally among folds)
- Clean data( gotta do- some patients do not have CFUs, will be useless) – JJ Done
- Read paper from Putonti – All
Week 3-4
- Build Algo(pick ya favorite model(s))
- KNN, Elastic Net (TBD), SVR (Sareh) and Random Forest machine (Karolina) learning models
Week 5
- Train Algo – All
Week 6
- Test Algo -All
Week 7
- Clean new data (if received) – JJ
- Validate Algo- All