Start Date: May 11, 2020
Email: [email protected]
- Paul Morley - [email protected]
- Noelle Noyes - [email protected]
- Enrique Doster - [email protected]
Slack group: https://meg-research.slack.com
- Join us here for general discussion and help with course content
- Slack invite link
- This link expires every 30 days, so let us know if it doesn't work for you.
- This dropbox folder contains all of the videos from our zoom course sessions and recordings from a previous MEG bioinformatics workshop.
https://mybinder.org/v2/gh/EnriqueDoster/MEG_intro_stats_course/master?urlpath=rstudio
These lessons are designed to introduce researchers to the R programming language for statistical analysis of metagenomic sequencing data. While we are primarily developing these training resources for the Microbial Ecology Group (MEG), we would love to get your input on improvements to any component so that we can one day provide this as a useful public resource. As the lessons are meant to be an informal collection of resources and tutorials, we have have liberally used parts and pieces of other online lessons and tailored it for our purposes. We attempt to give credit when possible by linking the original source and we are happy to hear recommendations for other resources to include.
The focus of lesson 1 is to help students install R on their computer, install the necessary R packages, and start playing around with R's functionality. In Lesson 2, students will learn how to calculate and plot summary statistics, including alpha-diversity indices to summarize the microbiome and resistome. In Lesson 3, we'll dive into count normalization using cumulative sum scaling (CSS), ordination with non-metric multidimensional scaling, differential abundance testing with a zero-inflated Gaussian (ZIG) model, and advanced plotting using ggplot2.
This will be the first time that we attempt going through this lesson with a group of students, so please participate in the slack group and ask any questions you have!
- We'll organize a group to all take the same lesson together and we'll have a virtual meeting once per week for 30 minutes to go over each of the steps in that lesson. The majority of the work will be self-directed and on your own time, but we encourage you to work in groups and participate in asking questions in the slack group. If you don't have any questions and find this all extremely easy, please help others with their questions and help us improve our lessons.
- There is 1 "deliverable" per step. Some steps require something be sent to "[email protected]" or will have a link to a corresponding set of questions.
We wholeheartedly encourage students to independently troubleshoot the majority of problems they might encounter by:
- googling it (or using another search engine)
- getting help from other students by using our slackgroup channel #2020-stats-tutorial-group
- searching bioinformatic forums such as (stackoverflow.com, biostars.org, seqanswers.com, etc.)
In addition, we will have "techie time" every Monday at 12pm-2pm MST on Slack to help address any ongoing issues.
Upon completion of these lessons, students will:
- have their computer set up with the R and RStudio software
- know how to read-in count matrices from bioinformatic analysis of sequence data
- be able to explore and summarize bioinformatic results using
- diversity indices and box plots
- ordination with non-metric multidimensional scaling (NMDS)
- heatmaps
- be familiar with common statistical techiniques such as:
- Wilcoxon test
- Generalized linear models
- Analysis of similarities (ANOSIM)
- Differential abundance testing using a zero-inflated Gaussian (ZIG) model
MEG resources
- MEG bioinformatic term glossary
- AMR ++ pipeline overview
- Bioinformatic AMR and 16S pipeline overview
- Bioinformatics statistics overview
R programming
- RStudio cheatsheets
- This website has tons of helpful cheatsheets for various R packages and analyses methods. Also includes cheatsheets translated to other languages.
- YaRrr! The Pirate’s Guide to R
- This is a free online book that goes over many useful topics in a quirky, but fun way! Follow along with our simplified R scripts in Lesson 1 and reference this book if you have any other questions.
- R programming coursera course
- This free coursera course goes in-depth with all of the functionality of R. It combines videos with example R scripts for you to follow along with. We recommend this course after you have been playing around with R a bit and want to learn more about the details into how R works.
- Introduction to R workshop
- We haven't personally tried this workshop, but they have a combination of videos, slides, and R code for various topics.
Data visualization
- dataviz project
- This website is for a private company, but they have a great interface for exploring different figure types
- Visual vocabulary
- Handy outline and explanation for the uses of different plots.
- You can also check out this interactive figure of the same material
- FT Visual Journalism Team
- Awesome site with articles covering various topics and with the emphasis on creating awesome graphics to convey
- Interactive Jupyter notebooks
Command-line
- Explain shell
- cool website that explains bash commands piece by piece
Statistics
- GUide to STatistical Analysis in Microbial Ecology (GUSTA ME)
- LHS 610: Exploratory Data Analysis for Health
- We haven't personally tried this course, but they provide great videos and code examples for learning how to explore data using R.
- #bioinformatics live twitter feed
- R-specific resources
- ggpubr
- Nice package for "publication-ready" figures.
- Harvard's Data Science: R Basics
- ggpubr
- Collaborative spreadsheet of resources
- Choose the right test
- Batch effects
- Tackling the widespread and critical impact of batch effects in high-throughput data
- Why Batch Effects Matter in Omics Data, and How to Avoid Them
- Beware the bane of batch effects
- Mitigating the adverse impact of batch effects in sample pattern detection
- Identifying and mitigating batch effects in whole genome sequencing data
- Why Batch Effects Matter in Omics Data, and How to Avoid Them
We'll start on May 11, 2020 at 12pm MT and have weekly virtual meetings on zoom. Please check Slack for updates!
Lesson 1 "Getting set-up with R"
- Step 1 - Download and install R/RStudio
- Start: May 11, 2020
- Requested completion: May 17, 2020
- If you finish with time to spare, move on to Step 2. In the case that everyone finishes Step 2 before our next meeting on May 18, we can move on to Step 3 ahead of schedule.
- Step 2 - Install R packages
- Start: May 18, 2020
- Requested completion: May 31, 2020
- Step 3 - Introduction to R
- Start: May 18, 2020
- Requested completion: May 31, 2020
- Step 4 - Reading-in data to R
- Start: June 1, 2020
- Requested completion: June 7, 2020
Lesson 2 - Data exploration and basic statistics
- Scheduled to begin June 8, 2020. Dates will be updated as we finalize scheduling.
- Step 1 - Calculating summary statistics
- Start: June 9, 2020
- Requested completion: June 14, 2020
- Step 2 - Introduction to plotting
- Start: June 15, 2020
- Requested completion: June 21, 2020
- Step 3 - Basic statistical testing (wilcoxon, linear models)
- Start: June 22, 2020
- Requested completion: June 28, 2020
Lesson 3 - Advanced statistical analyses and plotting
- Scheduled to begin July 9, 2020. Dates will be updated as we finalize scheduling.
- Step 1 - Count normalization
- Start: July 9, 2020
- Requested completion: July 15, 2020
- Step 2 - Ordination with non-metric multidimensional and statistical comparisons
- Start: July 16, 2020
- Requested completion: July 22, 2020
- Step 3 - Differential abundance testing with a Zero-inflated Gaussian model
- Start: July 23, 2020
- Requested completion: July 29, 2020
- Step 4 - Advanced plotting (heatmaps, volcano plots, ordination)
- Start: July 30, 2020
- Requested completion: August 5, 2020
- Step 5 - Learn to run R GUI code for exploratory figures
- Start: August 6, 2020
- Requested completion: August 12, 2020
Lesson 1 deliverables
- For Step 1 and 2, students must send a screenshot of their "sessionInfo()" to ensure that R and R studio are installed in addition to the necessary R packages, respectively.
- Step 3 - After familiarizing themselves with some basic R functionality, students will submit a short quiz.
- Step 4 - Given a count matrix file, taxonomy file, and sample metadata file, students must read-in all data to R and submit a short quiz.
Lesson 2 deliverables
- Step 1 - Students will calculate summary statistics for the microbiome/resistome and will submit a quiz.
- Step 2 - Students will learn to create different figures using ggplot2. They will follow instructions to create and submit two custom figures.
- Step 3 - Using the summary statistics from step 2, students will test for statistical differences between sample groups.
Lesson 3 deliverables
- TBD
The development of this tutorial was supported in part by USDA NIFA Grant No. 2018-51300-28563 and USDA NIFA Grant No. 2019-67017-29110, University of Minnesota College of Veterinary Medicine, The VERO Program at Texas A&M University and West Texas A&M University, and the State of Minnesota Agricultural Research, Education, Extension and Technology Transfer program.