-
Notifications
You must be signed in to change notification settings - Fork 83
Proposed Analysis: Batch effects in RNA-seq data #919
Comments
Linking the closed pull request relevant to this analysis as well: #628 |
Hi @natemella! @komalsrathi is familiar with some of these analyses through our work at CHOP, so I would love for her to weigh in here with her experience! |
Thanks, @jharenza for making the introduction. @komalsrathi, I'm excited to collaborate with you. |
Hi @natemella good to meet you virtually. So here's what we have done. Our goal was to combine 3 different RNA-seq datasets for the purpose of differential gene expression analysis:
GTEx and TGEN consist only of poly-A samples but we have stranded, poly-A and a third RNA enrichment method called Here, is the breakdown of the batches we have and as mentioned above, a batch is a combination of study-identifier and library-type separated by
We wanted to evaluate the following methods for batch-correction from the R package
In our case, we could not evaluate Here is my repo corresponding to the batch correction code: https://github.com/komalsrathi/rnaseq-batch-correction/tree/rnaseq-batch-correct/analyses/rnaseq-batch-correct/. There are a couple of scripts and I'll quickly elaborate:
Code snippet for back transforming batch corrected values: Density plot of housekeeping genes: t-SNE clustering of the entire matrix as well as housekeeping genes: I have tried to elaborate as much as possible, would appreciate your input and please feel free to ask any questions if not clear. |
@komalsrathi I keep getting 404s for those links - is the repository you are linking to private? |
Hi @jaclyn-taroni I am fixing it, didn't realize it is a fork of a private repo so not accessible. |
@jaclyn-taroni @jharenza I have just created a new public repo (updated my comment above). Please let me know if you have any issues. |
thanks @komalsrathi ! |
Thanks to all of you for you comments! As you may know, Nathan (a student in my lab) worked on an analysis of batch adjustment early last year. It looks like @komalsrathi has also done some work on adjusting for batch. @jaclyn-taroni or @jharenza I'm unsure of the current state of the paper and whether our previous analyses were useful or whether there's anything we could add to make it useful? We would also be happy to work with @komalsrathi if there's anything beneficial that we could add to her analyses. Or if not, that's fine, too. Please let me know your thoughts. (Sorry that I'm not as familiar as I should be with the process you are using.) Thanks! |
Hi @srp33, thanks for your comment! I will summarize the state of the analyses within this project. I am less familiar with @komalsrathi's analysis, as #919 (comment) is the first time it has been introduced to me. The short(ish) version: The results of batch correction we've seen so far from your group have not been included in the repository or project because they were not merged into the code base here. The lack of batch correct so far has resulted in a focus mostly on the larger, stranded dataset in current versions of the "overview" type display items ( As a rule, we will not use any analyses or output files in this project that do not make it through the analytical code review process. The longer version, with links to earlier discussion for context: There was an initial pull request to this project, #628, opened in March 2020 that did not get merged. Because that pull request was open for a few months, we closed it in August 2020 when the Docker image got overhauled (altering Docker was one of the outstanding issues) #628 (comment). There were a few outstanding question on the pull request related to the utility of the batch correction, quoting @cgreene from #628 (comment) here:
There were some results posted in response to @cgreene's comments here: #448 (comment) Follow-up suggestion for how to move forward with getting the code into the repository: #448 (comment) Outcome from a meeting on April 21, 2020: #448 (comment) - those action items are repeated in this issue 👍 I would welcome pull requests for batch correction if they are coming soon! 😄 Our pull request model works best if the code additions in any one pull request are pretty limited in scope so they can be reviewed in a timely manner. (@komalsrathi may be a great candidate for a scientific review!) It is also helpful if they do not hang out for too long without updates & re-review because they can become out of date with what is in the Please let me know if you have questions - happy to chat via another medium such as our Cancer Data Science Slack! |
Just wanted to let you know that @natemella is actively working on this, and we'll have something for you as soon as we can. |
Hi @natemella are you using |
Yes, I'm using TPM. Your explanation makes a lot more sense as to why ComBat_seq hasn't worked well. Thanks for reaching out. |
We would like to propose a continuation of the issue 448.
What are the scientific goals of the analysis?
Batch effects are commonly observed in RNA expression data. Such effects are likely less pronounced in RNA-Seq data than microarray data. However, batch effects may bias conclusions of studies that do not account for these effects. We wish to evaluate the level to which batch effects are present in the RNA-Seq data from this study and create alternative versions of the data that have been adjusted for batch effects.
What methods do you plan to use to accomplish the scientific goals?
We propose to use the BatchQC tool to evaluate the data for batch effects. BatchQC provides various visualization tools for evaluating batch effects. We will use these and prepare a summary based on our findings. ComBat is a widely used method that uses empirical Bayes methods for correcting batch effects. Currently, we are unsure whether batch is known for these samples. If batch is known, we will adjust for it using ComBat. If not, we will use SVA to identify surrogate variables and plot those against variables that may have a confounding effect.
What input data are required for this analysis?
RNA-Seq data. We will start with gene-summarized values but may also work with transcript-level data.
How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?
10-20 hours. It will require at least two steps (examining the data using MultiQC and applying ComBat/sva).
Who will complete the analysis (please add a GitHub handle here if relevant)?
@nathanmella and @srp33
What relevant scientific literature relates to this analysis?
https://www.ncbi.nlm.nih.gov/pubmed/20838408
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5167063/
https://www.ncbi.nlm.nih.gov/pubmed/22257669
As shown in our previous work, we found identified the extent to which batches and histologies are confounded. We plan to finish this analysis and complete the following:
The text was updated successfully, but these errors were encountered: