Skip to content

This repository contains training material developed by the Microbial Ecology Group to introduce researchers to the R programming language for statistical analysis of metagenomic sequencing data.

License

Notifications You must be signed in to change notification settings

EnriqueDoster/MEG_intro_stats_course

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction to statistical analysis of metagenomic sequencing data

Course syllabus

Start Date: May 11, 2020

Email: [email protected]

Slack group: https://meg-research.slack.com

  • Join us here for general discussion and help with course content
  • Slack invite link
    • This link expires every 30 days, so let us know if it doesn't work for you.

Dropbox link

  • This dropbox folder contains all of the videos from our zoom course sessions and recordings from a previous MEG bioinformatics workshop.

Binder

RStudio Binder

https://mybinder.org/v2/gh/EnriqueDoster/MEG_intro_stats_course/master?urlpath=rstudio

Course content

Summary

These lessons are designed to introduce researchers to the R programming language for statistical analysis of metagenomic sequencing data. While we are primarily developing these training resources for the Microbial Ecology Group (MEG), we would love to get your input on improvements to any component so that we can one day provide this as a useful public resource. As the lessons are meant to be an informal collection of resources and tutorials, we have have liberally used parts and pieces of other online lessons and tailored it for our purposes. We attempt to give credit when possible by linking the original source and we are happy to hear recommendations for other resources to include.

The focus of lesson 1 is to help students install R on their computer, install the necessary R packages, and start playing around with R's functionality. In Lesson 2, students will learn how to calculate and plot summary statistics, including alpha-diversity indices to summarize the microbiome and resistome. In Lesson 3, we'll dive into count normalization using cumulative sum scaling (CSS), ordination with non-metric multidimensional scaling, differential abundance testing with a zero-inflated Gaussian (ZIG) model, and advanced plotting using ggplot2.

This will be the first time that we attempt going through this lesson with a group of students, so please participate in the slack group and ask any questions you have!

  • We'll organize a group to all take the same lesson together and we'll have a virtual meeting once per week for 30 minutes to go over each of the steps in that lesson. The majority of the work will be self-directed and on your own time, but we encourage you to work in groups and participate in asking questions in the slack group. If you don't have any questions and find this all extremely easy, please help others with their questions and help us improve our lessons.
    • There is 1 "deliverable" per step. Some steps require something be sent to "[email protected]" or will have a link to a corresponding set of questions.

Techie time

We wholeheartedly encourage students to independently troubleshoot the majority of problems they might encounter by:

  • googling it (or using another search engine)
  • getting help from other students by using our slackgroup channel #2020-stats-tutorial-group
  • searching bioinformatic forums such as (stackoverflow.com, biostars.org, seqanswers.com, etc.)

In addition, we will have "techie time" every Monday at 12pm-2pm MST on Slack to help address any ongoing issues.

Learning objectives:

Upon completion of these lessons, students will:

  • have their computer set up with the R and RStudio software
  • know how to read-in count matrices from bioinformatic analysis of sequence data
  • be able to explore and summarize bioinformatic results using
    • diversity indices and box plots
    • ordination with non-metric multidimensional scaling (NMDS)
    • heatmaps
  • be familiar with common statistical techiniques such as:
    • Wilcoxon test
    • Generalized linear models
    • Analysis of similarities (ANOSIM)
    • Differential abundance testing using a zero-inflated Gaussian (ZIG) model

Resources:

MEG resources

R programming

  • RStudio cheatsheets
    • This website has tons of helpful cheatsheets for various R packages and analyses methods. Also includes cheatsheets translated to other languages.
  • YaRrr! The Pirate’s Guide to R
    • This is a free online book that goes over many useful topics in a quirky, but fun way! Follow along with our simplified R scripts in Lesson 1 and reference this book if you have any other questions.
  • R programming coursera course
    • This free coursera course goes in-depth with all of the functionality of R. It combines videos with example R scripts for you to follow along with. We recommend this course after you have been playing around with R a bit and want to learn more about the details into how R works.
  • Introduction to R workshop
    • We haven't personally tried this workshop, but they have a combination of videos, slides, and R code for various topics.

Data visualization

Command-line

  • Explain shell
    • cool website that explains bash commands piece by piece

Statistics

Timeline

We'll start on May 11, 2020 at 12pm MT and have weekly virtual meetings on zoom. Please check Slack for updates!

Lesson 1 "Getting set-up with R"

  • Step 1 - Download and install R/RStudio
    • Start: May 11, 2020
    • Requested completion: May 17, 2020
      • If you finish with time to spare, move on to Step 2. In the case that everyone finishes Step 2 before our next meeting on May 18, we can move on to Step 3 ahead of schedule.
  • Step 2 - Install R packages
    • Start: May 18, 2020
    • Requested completion: May 31, 2020
  • Step 3 - Introduction to R
    • Start: May 18, 2020
    • Requested completion: May 31, 2020
  • Step 4 - Reading-in data to R
    • Start: June 1, 2020
    • Requested completion: June 7, 2020

Lesson 2 - Data exploration and basic statistics

  • Scheduled to begin June 8, 2020. Dates will be updated as we finalize scheduling.
  • Step 1 - Calculating summary statistics
    • Start: June 9, 2020
    • Requested completion: June 14, 2020
  • Step 2 - Introduction to plotting
    • Start: June 15, 2020
    • Requested completion: June 21, 2020
  • Step 3 - Basic statistical testing (wilcoxon, linear models)
    • Start: June 22, 2020
    • Requested completion: June 28, 2020

Lesson 3 - Advanced statistical analyses and plotting

  • Scheduled to begin July 9, 2020. Dates will be updated as we finalize scheduling.
  • Step 1 - Count normalization
    • Start: July 9, 2020
    • Requested completion: July 15, 2020
  • Step 2 - Ordination with non-metric multidimensional and statistical comparisons
    • Start: July 16, 2020
    • Requested completion: July 22, 2020
  • Step 3 - Differential abundance testing with a Zero-inflated Gaussian model
    • Start: July 23, 2020
    • Requested completion: July 29, 2020
  • Step 4 - Advanced plotting (heatmaps, volcano plots, ordination)
    • Start: July 30, 2020
    • Requested completion: August 5, 2020
  • Step 5 - Learn to run R GUI code for exploratory figures
    • Start: August 6, 2020
    • Requested completion: August 12, 2020

Deliverables:

Lesson 1 deliverables

  • For Step 1 and 2, students must send a screenshot of their "sessionInfo()" to ensure that R and R studio are installed in addition to the necessary R packages, respectively.
  • Step 3 - After familiarizing themselves with some basic R functionality, students will submit a short quiz.
  • Step 4 - Given a count matrix file, taxonomy file, and sample metadata file, students must read-in all data to R and submit a short quiz.

Lesson 2 deliverables

  • Step 1 - Students will calculate summary statistics for the microbiome/resistome and will submit a quiz.
  • Step 2 - Students will learn to create different figures using ggplot2. They will follow instructions to create and submit two custom figures.
  • Step 3 - Using the summary statistics from step 2, students will test for statistical differences between sample groups.

Lesson 3 deliverables

  • TBD

Funding Information:

The development of this tutorial was supported in part by USDA NIFA Grant No. 2018-51300-28563 and USDA NIFA Grant No. 2019-67017-29110, University of Minnesota College of Veterinary Medicine, The VERO Program at Texas A&M University and West Texas A&M University, and the State of Minnesota Agricultural Research, Education, Extension and Technology Transfer program.

About

This repository contains training material developed by the Microbial Ecology Group to introduce researchers to the R programming language for statistical analysis of metagenomic sequencing data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages