More and more experimental data are available that have fueled ground breaking discoveries. Beyond the original intents of the experiments, these data can be used to discover even more. This is where machine learning (ML) comes in. We can use computers to learn from data and generate models that can predict a biological phenomenon of interest - e.g., will this gene be lethal when it is knocked-out, or which genetic variants can meaningfully predict a phenotype of interests. To learn more about ML, we created this workshop to provide an introduction on the following topics:
- What is machine learning and why is it useful?
- How does machine learning work?
- What are some example machine learning applications in biology?
- How can we feed data into machine learning tools to make discoveries?
- What are the best practices when doing machine learning?
- Where to go to learn more?
The workshop will include presentations, discussions, and hands-on sections for those who can complete the pre-workshop components without issue.
- Our lab website
- Workshop jupyter notebooks
- An example data to run an ML project based on this paper.
The workshop example is provided as a Jupyter notebook. It is a document generated by the Jupyter Lab or Jupyter Notebook applications. A notebook can contain both computer codes in popular languages such as Python and R, and texts in the form of paragraph, equations, figures, links, etc.
To follow what we have shown in the workshop, you need the following:
- Git: for you to "clone" this repository to your computer to play with.
- Jupyter Lab: the application to view, edit, and execute codes in the notebook.
- Scikit-Learn and others: the software packages Jupyter Lab relies on to run the codes.
Please make sure you:
- Have a look at the pre-workshop notebook through GitHub and take some notes on the questions asked.
- Watch this video, and this video on getting Jupyter notebook to run.
You can download the notebooks and data from this repository, preferably by setting the following up:
git
, a version control software (i.e., a tool to keep track of updates to codes) widely used by folks writing software in any language.- Github is a code hosting platform that uses
git
for version control and collaboration (i.e., many people can work on the same codes).
If you don't have git and/or Github account, do the following:
- Create a GitHub Account
- Download and install Github Desktop
- Or you can use Git if you are familar with version control and command-line interface. Note that the following info is for using Github Desktop.
- Clone the ML_workshop by following this instruction and the following screenshot.
- Note: You can specify where the repository goes in your computer. We suggest leaving it as default and to remember where it is - we need it later.
- Navigate to the location where the cloned repository is and confirm that it is there.
Anaconda is a a free and open-source distribution of the programming languages Python and R and is a widely used platform for computational and data science applications.
- Download the Python 3.X version of Anaconda.
- Install Anaconda using the instructions.
- Open your terminal in Mac or PC
- Note: For PC, you need to open the terminal by "Running as Administrator". If you are not familiar with this, see this post for more info.
- Issue the following command to make sure Anaconda installation is complete:
conda list
The above command allows you to see what software packages have been installed.
Conda is a package/environment management system. It deals with installing software packages in your computer. It also creates and manage virtual environments where each environment you have a specific set of software for a general category of tasks.
- Create an
ml_workshop
environment and activate it:
conda create -n ml_workshop python
conda activate ml_workshop
- Note: When prompted with
Proceed
, typey
.
- Install software packages and their dependencies:
conda install jupyterlab ipykernel ipywidgets matplotlib pandas scikit-learn seaborn shap tqdm
pip install imbalanced-learn
- Run Jupyter Lab
- If you use Linux or Mac OS:
jupyter lab
- If you use PC, and your Github folder is in C:/ drive, then do:
jupyter lab --notebook-dir=C:/
- If you use PC and your Github folder is in D:/ drive, do:
jupyter lab --notebook-dir=D:/
-
In the Jupyter lab window that opens, on the left panel, navigate to ML_workshop, the directory where the cloned Github repository is stored.
-
Open
ML_workshop-part_a-preparation.ipynb
-
Run each code element by clicking
SHIFT + ENTER
.