Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sars anvil tutorial #4250

Merged
merged 36 commits into from
Jul 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
853ce2e
Publish changes
nakucher Jul 2, 2023
935a145
add-sars-in-anvil-screenshots
nakucher Jul 2, 2023
6c093f5
add-links-for-images
nakucher Jul 2, 2023
8b8b50c
update-contributors.yaml
nakucher Jul 2, 2023
77db020
update-contributors.yaml
nakucher Jul 2, 2023
b4cd870
add-time-estimation
nakucher Jul 2, 2023
f5ab929
add-billing-info
nakucher Jul 2, 2023
61e378e
add-conclusion
nakucher Jul 2, 2023
0f4c153
Update topics/sequence-analysis/tutorials/sars-with-galaxy-on-anvil/t…
nakucher Jul 4, 2023
624f163
Update topics/sequence-analysis/tutorials/sars-with-galaxy-on-anvil/t…
nakucher Jul 4, 2023
21dbcf4
add-funder-update-toc
nakucher Jul 4, 2023
f95636f
add-qc-faqs
nakucher Jul 4, 2023
9df64c4
fix-funding-key
nakucher Jul 4, 2023
51db45f
test-<figure>-for-yt-video-lint
nakucher Jul 4, 2023
4b04e8d
fix-video-target
nakucher Jul 4, 2023
ad69e96
add-zenodo-and-minor-fixes
nakucher Jul 4, 2023
06b2927
Merge branch 'galaxyproject:main' into add-sars-anvil-tutorial
nakucher Jul 4, 2023
c4e3476
Merge branch 'main' into add-sars-anvil-tutorial
nakucher Jul 10, 2023
efa1c32
add-tag
nakucher Jul 10, 2023
d2c53b4
fix-image-links
nakucher Jul 10, 2023
8d80c23
Update topics/sequence-analysis/tutorials/sars-with-galaxy-on-anvil/t…
nakucher Jul 10, 2023
851b0c2
Update topics/sequence-analysis/tutorials/sars-with-galaxy-on-anvil/t…
nakucher Jul 10, 2023
ccc4b22
Update topics/sequence-analysis/tutorials/sars-with-galaxy-on-anvil/t…
nakucher Jul 10, 2023
8ac91d1
Update topics/sequence-analysis/tutorials/sars-with-galaxy-on-anvil/t…
nakucher Jul 10, 2023
834151f
Update topics/sequence-analysis/tutorials/sars-with-galaxy-on-anvil/t…
nakucher Jul 10, 2023
c171a91
Update topics/sequence-analysis/tutorials/sars-with-galaxy-on-anvil/t…
nakucher Jul 10, 2023
35a7b5d
Update topics/sequence-analysis/tutorials/sars-with-galaxy-on-anvil/t…
nakucher Jul 10, 2023
bebaf80
Detect both flavours of bad tool links
hexylena Jul 10, 2023
14f6f74
fix gtn link detection
hexylena Jul 10, 2023
96a7ccc
Extract quality score into its own FAQ
hexylena Jul 10, 2023
4f14145
Use new FAQ
hexylena Jul 10, 2023
e8479e1
fix-tool-link
nakucher Jul 10, 2023
cd20bd4
fix qc linting complaints
hexylena Jul 10, 2023
efa9c8a
missed one
hexylena Jul 10, 2023
ac9aade
ensure it does not crash if provided EN
hexylena Jul 10, 2023
1c2d1c4
fix-fastqc-tool-id
nakucher Jul 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 52 additions & 1 deletion CONTRIBUTORS.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,10 @@ aurelienmoumbock:
twitter: FMoumbock
joined: 2022-02

avahoffman:
name: Ava Hoffman
joined: 2023-06

avani-k:
name: Avani Khadilkar
email: [email protected]
Expand Down Expand Up @@ -291,6 +295,10 @@ CameronFRWatson:
email: [email protected]
orcid: 0000-0002-6942-2469

cansavvy:
name: Candace Savonen
joined: 2023-06

cat-bro:
name: Catherine Bromhead
matrix: 'cat-bro:matrix.org'
Expand Down Expand Up @@ -376,6 +384,10 @@ cstritt:
name: Christoph Stritt
joined: 2022-03

cutsort:
name: Frederick Tan
joined: 2023-06

d-salgado:
name: David Salgado
joined: 2022-10
Expand Down Expand Up @@ -447,6 +459,10 @@ eancelet:
joined: 2021-01
elixir_node: fr

ehumph:
name: Elizabeth Humphries
joined: 2023-06

ElectronicBlueberry:
name: Laila Los
joined: 2023-04
Expand Down Expand Up @@ -818,6 +834,10 @@ jsaintvanne:
name: Julien Saint-Vanne
joined: 2020-01

jtleek:
name: Jeffrey T. Leek
joined: 2023-06

jxtx:
name: James Taylor
joined: 2018-06
Expand All @@ -826,6 +846,10 @@ jxtx:

His impacts on the Galaxy community, have been incredible, and his loss is keenly felt.

katherinecox:
name: Katherine Cox
joined: 2023-06

katrinleinweber:
name: Katrin Leinweber
email: [email protected]
Expand Down Expand Up @@ -1120,6 +1144,11 @@ nagoue:
orcid: 0000-0003-2750-1473
joined: 2019-07

nakucher:
name: Natalie Kucher
email: [email protected]
joined: 2023-06

natefoo:
name: Nate Coraor
matrix: 'natefoo:matrix.org'
Expand Down Expand Up @@ -1334,6 +1363,10 @@ robertmand:
joined: 2021-10
elixir_node: uk

robertmeller:
name: Robert Meller
joined: 2023-06

reginaesinamabotsi:
name: Regina Esinam Abotsi
joined: 2018-06
Expand Down Expand Up @@ -1782,6 +1815,24 @@ elixir-converge:
funding_statement: |
ELIXIR CONVERGE is connecting and align ELIXIR Nodes to deliver sustainable FAIR life-science data management services. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement № 871075

nhgri-gdscn:
name: National Human Genome Research Institute Genomic Data Science Community Network
github: false
joined: 2023-06
avatar: https://www.genome.gov/themes/custom/nhgri/assets/global/NHGRI-logo.svg
url: https://www.genome.gov/
funder: true
funding_id: 75N92022P00232
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking for the funding IDs I find links like
https://reporter.nih.gov/search/15E8C00F4784C1D77598B8961CAA4A01A2FFCEB861BF/projects?shared=true&legacy=1&sl=15E8C00F4784C1D77598B8961CAA4A01A2FFCEB861BF

but that seems to be broken, hmm. Normally we link out to those pages but here I'm not sure it's possible which is unfortunate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yes it looks like RePORTER has been down for a little while :(


nhgri-anvil:
name: National Human Genome Research Institute Genomic Data Science Analysis, Visualization, and Informatics Lab-Space
github: false
joined: 2023-06
avatar: https://www.genome.gov/themes/custom/nhgri/assets/global/NHGRI-logo.svg
url: https://www.genome.gov/Funded-Programs-Projects/Computational-Genomics-and-Data-Science-Program/Genomic-Analysis-Visualization-Informatics-Lab-space-AnVIL
funder: true
funding_id: U24HG010263

ai4life:
name: AI4Life
github: false
Expand All @@ -1793,4 +1844,4 @@ ai4life:
funding_system: cordis
funding_statement: |
AI4Life has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement number 101057970.

2 changes: 1 addition & 1 deletion _plugins/jekyll-duration.rb
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
hour = 'hour'
hours = 'hours'
minutes = 'minutes'
if @context.registers[:page]&.key?('lang')
if @context.registers[:page]&.key?('lang') and @context.registers[:page]['lang'] != 'en'

Check warning on line 32 in _plugins/jekyll-duration.rb

View workflow job for this annotation

GitHub Actions / lint

[rubocop] reported by reviewdog 🐶 Use `&&` instead of `and`. Raw Output: _plugins/jekyll-duration.rb:32:50: C: Style/AndOr: Use `&&` instead of `and`.
lang = @context.registers[:page]['lang']
hour = @context.registers[:site].data['lang'][lang]['hour']
hours = @context.registers[:site].data['lang'][lang]['hours']
Expand Down
5 changes: 3 additions & 2 deletions bin/lint.rb
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@
# Linting functions for the GTN
module GtnLinter
@BAD_TOOL_LINK = /{% tool (\[[^\]]*\])\(https?.*tool_id=([^)]*)\)\s*%}/i
@BAD_TOOL_LINK2 = /{% tool (\[[^\]]*\])\(https:\/\/toolshed.g2([^)]*)\)\s*%}/i

Check warning on line 111 in bin/lint.rb

View workflow job for this annotation

GitHub Actions / lint

[rubocop] reported by reviewdog 🐶 Use `%r` around regular expression. Raw Output: bin/lint.rb:111:21: C: Style/RegexpLiteral: Use `%r` around regular expression.

def self.find_matching_texts(contents, query)
contents.map.with_index do |text, idx|
Expand Down Expand Up @@ -150,7 +151,7 @@
def self.link_gtn_tutorial_external(contents)
find_matching_texts(
contents,
%r{\((https?://(training.galaxyproject.org|galaxyproject.github.io)/training-material/(.*tutorial).html)\)}
%r{\((https?://(training.galaxyproject.org|galaxyproject.github.io)/training-material/[^)]*)\)}
)
.map do |idx, _text, selected|
ReviewDogEmitter.error(
Expand Down Expand Up @@ -349,7 +350,7 @@
end

def self.bad_tool_links(contents)
find_matching_texts(contents, @BAD_TOOL_LINK)
find_matching_texts(contents, @BAD_TOOL_LINK) + find_matching_texts(contents, @BAD_TOOL_LINK2)
.map do |idx, _text, selected|
ReviewDogEmitter.error(
path: @path,
Expand Down
35 changes: 35 additions & 0 deletions topics/sequence-analysis/faqs/quality_score.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
title: Quality Scores
area: format
box_type: details
layout: faq
contributors: [bebatut, nakucher, hexylena]
---

But what does this quality score mean?

The quality score for each sequence is a string of characters, one for each base of the nucleotide sequence, used to characterize the probability of misidentification of each base. The score is encoded using the ASCII character table (with [some historical differences](https://en.wikipedia.org/wiki/FASTQ_format#Encoding)):

To save space, the sequencer records an [ASCII character](http://drive5.com/usearch/manual/quality_score.html) to represent scores 0-42. For example 10 corresponds to "+" and 40 corresponds to "I". FastQC knows how to translate this. This is often called "Phred" scoring.

![Encoding of the quality score with ASCII characters for different Phred encoding. The ascii code sequence is shown at the top with symbols for 33 to 64, upper case letters, more symbols, and then lowercase letters. Sanger maps from 33 to 73 while solexa is shifted, starting at 59 and going to 104. Illumina 1.3 starts at 54 and goes to 104, Illumina 1.5 is shifted three scores to the right but still ends at 104. Illumina 1.8+ goes back to the Sanger except one single score wider. Illumina]({{site.baseurl}}/topics/sequence-analysis/faqs/images/fastq-quality-encoding.png)

So there is an ASCII character associated with each nucleotide, representing its [Phred quality score](https://en.wikipedia.org/wiki/Phred_quality_score), the probability of an incorrect base call:

Phred Quality Score | Probability of incorrect base call | Base call accuracy
--- | --- | ---
10 | 1 in 10 | 90%
20 | 1 in 100 | 99%
30 | 1 in 1000 | 99.9%
40 | 1 in 10,000 | 99.99%
50 | 1 in 100,000 | 99.999%
60 | 1 in 1,000,000 | 99.9999%


What does 0-42 represent? These numbers, when plugged into a formula, tell us the probability of an error for that base. This is the formula, where Q is our quality score (0-42) and P is the probability of an error:

```
Q = -10 log10(P)
```

Using this formula, we can calculate that a quality score of 40 means only 0.00010 probability of an error!
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 7 additions & 22 deletions topics/sequence-analysis/tutorials/quality-control/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,22 +112,7 @@ GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGG

It means that the fragment named `@M00970` corresponds to the DNA sequence `GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA` and this sequence has been sequenced with a quality `GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(`.

But what does this quality score mean?

The quality score for each sequence is a string of characters, one for each base of the nucleic sequence, used to characterize the probability of mis-identification of each base. The score is encoded using the ASCII character table (with [some historical differences](https://en.wikipedia.org/wiki/FASTQ_format#Encoding)):

![Encoding of the quality score with ASCII characters for different Phred encoding. The ascii code sequence is shown at the top with symbols for 33 to 64, upper case letters, more symbols, and then lowercase letters. Sanger maps from 33 to 73 while solexa is shifted, starting at 59 and going to 104. Illumina 1.3 starts at 54 and goes to 104, Illumina 1.5 is shifted three scores to the right but still ends at 104. Illumina 1.8+ goes back to the Sanger except one single score wider. Illumina](../../../sequence-analysis/images/fastq-quality-encoding.png)

So there is an ASCII character associated with each nucleotide, representing its [Phred quality score](https://en.wikipedia.org/wiki/Phred_quality_score), the probability of an incorrect base call:

Phred Quality Score | Probability of incorrect base call | Base call accuracy
--- | --- | ---
10 | 1 in 10 | 90%
20 | 1 in 100 | 99%
30 | 1 in 1000 | 99.9%
40 | 1 in 10,000 | 99.99%
50 | 1 in 100,000 | 99.999%
60 | 1 in 1,000,000 | 99.9999%
{% snippet topics/sequence-analysis/faqs/quality_score.md %}

> <question-title></question-title>
>
Expand Down Expand Up @@ -168,7 +153,7 @@ Rather than looking at quality scores for each individual read, FASTQE looks at

![FASTQE before](../../images/quality-control/fastqe-mean-before.png "FASTQE mean scores")

You can see the score for each emoji [here](https://github.com/fastqe/fastqe#scale). The emojis below, with Phred scores less than 20, are the ones we hope we don't see much.
You can see the score for each [emoji in fastqe's documentation](https://github.com/fastqe/fastqe#scale). The emojis below, with Phred scores less than 20, are the ones we hope we don't see much.

Phred Quality Score | ASCII code | Emoji
--- | --- | ---
Expand Down Expand Up @@ -310,7 +295,7 @@ It is normal with all Illumina sequencers for the median quality score to start

When the median quality is below a Phred score of ~20, we should consider trimming away bad quality bases from the sequence. We will explain that process in the Trim and filter section.

#### Adapter Content
### Adapter Content

![Adapter Content](../../images/quality-control/adapter_content-before.png "Adapter Content")

Expand All @@ -332,13 +317,13 @@ We can run an trimming tool such as Cutadapt to remove this adapter. We will exp
> <tip-title>Take a shortcut</tip-title>
>
> The following sections go into detail about some of the other plots generated by FastQC. Note that some plots/modules may give warnings but be normal
> for the type of data you're working with, as discussed below and [here](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/).
> for the type of data you're working with, as discussed below and [in the FASTQC FAQ](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/).
> The other plots give us information to more deeply understand the quality of the data, and to see if changes could be made in the lab to get higher-quality data in the future.
> These sections are **optional**, and if you would like to skip these you can:
> - Jump straight to the [next section](#trim-and-filter---short-reads) to learn about trimming paired-end data
{: .tip}

#### Per tile sequence quality
### Per tile sequence quality

This plot enables you to look at the quality scores from each tile across all of your bases to see if there was a loss in quality associated with only one part of the flowcell. The plot shows the deviation from the average quality for each flowcell tile. The hotter colours indicate that reads in the given tile have worse qualities for that position than reads in other tiles. With this sample, you can see that certain tiles show consistently poor quality, especially from ~100bp onwards. A good plot should be blue all over.

Expand Down Expand Up @@ -413,7 +398,7 @@ But there are also other situations in which an unusually-shaped distribution ma
> {: .solution }
{: .question}

#### Sequence length distribution
### Sequence length distribution

This plot shows the distribution of fragment sizes in the file which was analysed. In many cases this will produce a simple plot showing a peak only at one size, but for variable length FASTQ files this will show the relative amounts of each different size of sequence fragment. Our plot shows variable length as we trimmed the data. The biggest peak is at 296bp but there is a second large peak at ~100bp. So even though our sequences range up to 296bp in length, a lot of the good-quality sequences are shorter. This corresponds with the drop we saw in the sequence quality at ~100bp and the red stripes starting at this position in the per tile sequence quality plot.

Expand Down Expand Up @@ -570,7 +555,7 @@ The quality drops in the middle of these sequences. This could cause bias in dow

To accomplish this task we will use [Cutadapt](https://cutadapt.readthedocs.io/en/stable/guide.html) {% cite marcel2011cutadapt %}, a tool that enhances sequence quality by automating adapter trimming as well as quality control. We will:

- Trim low-quality bases from the ends. Quality trimming is done before any adapter trimming. We will set the quality threshold as 20, a commonly used threshold, see more [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531872-Phred-scaled-quality-scores).
- Trim low-quality bases from the ends. Quality trimming is done before any adapter trimming. We will set the quality threshold as 20, a commonly used threshold, see more [in GATK's Phred Score FAQ](https://gatk.broadinstitute.org/hc/en-us/articles/360035531872-Phred-scaled-quality-scores).
- Trim adapter with Cutadapt. For that we need to supply the sequence of the adapter. In this sample, Nextera is the adapter that was detected. We can find the sequence of the Nextera adapter on the [Illumina website here](https://support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html) `CTGTCTCTTATACACATCT`. We will trim that sequence from the 3' end of the reads.
- Filter out sequences with length < 20 after trimming

Expand Down
Loading
Loading