Merge pull request #4292 from wm75/exome-seq-modernize

Update exome-seq tutorial/WFs to latest tool versions
galaxyproject · Jul 19, 2023 · 31e21a4 · 31e21a4
2 parents 7daef0c + c1a2672
commit 31e21a4
Show file tree

Hide file tree

Showing 16 changed files with 2,035 additions and 2,126 deletions.
diff --git a/topics/variant-analysis/tutorials/exome-seq/tutorial.md b/topics/variant-analysis/tutorials/exome-seq/tutorial.md
@@ -285,16 +285,19 @@ data for either analysis.
 >    https://zenodo.org/record/3243160/files/hg19_chr8.fa.gz
 >    ```
 >
->    Make sure you specify the datatype as `fasta` in the import dialog.
+>    In the upload dialog, make sure you specify:
+>
+>    - **Type**: `fasta`
+>    - **Genome**: `Human Feb. 2009 (GRCh37/hg19) (hg19)`
 >
 >    Alternatively, load the dataset from a shared data library.
 >
 > 10. Rename the reference genome
 >
 >     The reference genome you have imported above came as a compressed
 >     file, but got unpacked by Galaxy to plain `fasta` format according to
->     your datatype selection. You may now wish to remove the `.gz` suffix
->     from the dataset name.
+>     your datatype selection. At a minimum, you may now wish to remove the
+>     `.gz` suffix from the dataset name to avoid confusion.
 >
 {: .hands_on}
 
@@ -324,7 +327,7 @@ of NGS data quality control, you may want to have a look at the dedicated
 tutorial on [Quality control]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %}).
 
 > <hands-on-title>Quality control of the input datasets</hands-on-title>
-> 1. Run **FastQC** {% icon tool %} on each of your six fastq datasets
+> 1. Run {% tool [FastQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.74+galaxy0) %} on each of your six fastq datasets
 >       - {% icon param-files %} *"Short read data from your current history"*: all 6 FASTQ  datasets selected with **Multiple datasets**
 >
 >    {% snippet faqs/galaxy/tools_select_multiple_datasets.md %}
@@ -333,7 +336,7 @@ tutorial on [Quality control]({% link topics/sequence-analysis/tutorials/quality
 >    data, another one with an html report of the findings for each input
 >    dataset) will get added to your history.
 >
-> 2. Use **MultiQC** {% icon tool %} to aggregate the raw **FastQC** data of all input datasets into one report
+> 2. Use {% tool [MultiQC](toolshed.g2.bx.psu.edu/repos/iuc/multiqc/multiqc/1.11+galaxy1) %} to aggregate the raw **FastQC** data of all input datasets into one report
 >      - In *"Results"*
 >        - *"Which tool was used generate logs?"*: `FastQC`
 >        - In *"FastQC output"*
@@ -356,7 +359,7 @@ tutorial on [Quality control]({% link topics/sequence-analysis/tutorials/quality
 >    > >    consider trimming the 3' ends of reads (base qualities decline
 >    > >    slightly towards the 3' ends) or to filter out the small fraction
 >    > >    of reads with a mean base quality < 5.
->    > >    Feel free to run, *e.g.*, **Trimmomatic** {% icon tool %} on the
+>    > >    Feel free to run, *e.g.*, {% tool [Trimmomatic](toolshed.g2.bx.psu.edu/repos/pjbriggs/trimmomatic/trimmomatic/0.38.1) %} on the
 >    > >    fastq datasets if you want to, but don't expect this to have a big
 >    > >    effect on the analysis given the high overall quality of the data
 >    > >    of all samples.
@@ -390,7 +393,7 @@ NGS reads datasets to the human reference genome. We recommend you to follow
 the dedicated [Mapping tutorial]({% link topics/sequence-analysis/tutorials/mapping/tutorial.md %}), if you need a general introduction to read mapping.
 
 > <hands-on-title>Read Mapping</hands-on-title>
-> 1. **Map with BWA-MEM** {% icon tool %} to map the reads from the **father** sample to the reference genome
+> 1. {% tool [Map with BWA-MEM](toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2) %} to map the reads from the **father** sample to the reference genome
 >    - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a built-in genome index`
 >        - *"Using reference genome"*: `Human: hg19` (or a similarly named option)
 >
@@ -446,7 +449,7 @@ the dedicated [Mapping tutorial]({% link topics/sequence-analysis/tutorials/mapp
 >    >
 >    {: .warning}
 >
-> 2. **Map with BWA-MEM** {% icon tool %} to map the reads from the **mother** sample to the reference genome **using the same parameters as before** except
+> 2. {% tool [Map with BWA-MEM](toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2) %} to map the reads from the **mother** sample to the reference genome **using the same parameters as before** except
 >
 >    - *"Single or Paired-end reads"*: `Paired`
 >       - {% icon param-file %} *"Select first set of reads"*: the
@@ -460,7 +463,7 @@ the dedicated [Mapping tutorial]({% link topics/sequence-analysis/tutorials/mapp
 >      - *"Auto-assign"*: `No`
 >        - *"Read group sample name (SM)"*: `mother`
 >
-> 3. **Map with BWA-MEM** {% icon tool %} to map the reads from the **child** sample to the reference genome **using the same parameters as before** except
+> 3. {% tool [Map with BWA-MEM](toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2) %} to map the reads from the **child** sample to the reference genome **using the same parameters as before** except
 >
 >    - *"Single or Paired-end reads"*: `Paired`
 >       - {% icon param-file %} *"Select first set of reads"*: the
@@ -479,8 +482,16 @@ the dedicated [Mapping tutorial]({% link topics/sequence-analysis/tutorials/mapp
 # Mapped reads postprocessing
 
 At this point in the analysis you should have obtained three mapped reads
-datasets in `bam` format. Ideally, these would carry `#father`, `#mother` and
-`#child` tags for quick identification of the samples they provide data for.
+datasets in `bam` format. Each of these datasets should:
+
+- have its *database* set to the key `hg19`
+
+  Please correct any missing (`?`) or wrong keys now!
+
+  {% snippet faqs/galaxy/datasets_change_dbkey.md dbkey="Human Feb. 2009 (GRCh37/hg19) (hg19)" %}
+
+- ideally, carry one of the `#father`, `#mother` or `#child` tags
+  for quick identification of the samples they provide data for.
 
 In principle, you could use these datasets directly for variant calling, and in
 many cases, including this one, this would be sufficient to identify the
@@ -532,17 +543,15 @@ To produce new filtered BAM datasets with only mapped reads the mate of which is
 
 > <hands-on-title>Filtering for read pair mapping status</hands-on-title>
 >
-> 1. **Filter SAM or BAM, output SAM or BAM** {% icon tool %} with the following
+> 1. {% tool [Samtools view](toolshed.g2.bx.psu.edu/repos/iuc/samtools_view/samtools_view/1.15.1+galaxy0) %} with the following
 > parameters (leaving non-mentioned ones at their defaults):
->   - {% icon param-files %} *"SAM or BAM file to filter"*: all 3 mapped reads
+>   - {% icon param-files %} *"SAM/BAM/CRAM data set"*: all 3 mapped reads
 >     datasets of the family trio, outputs of **Map with BWA-MEM**
 >     {% icon tool %}
->   - *"Filter on bitwise flag"*: `yes`
->     - *"Only output alignments with all of these flag bits set"*:
->       Do not select anything here!
->     - *"Skip alignments with any of these flag bits set"*:
->        - {% icon param-check %} *"The read is unmapped"*
->        - {% icon param-check %} *"The mate is unmapped"*
+>   - *"What would you like to look at?"*: `A filtered/subsampled selection of reads`
+>     - In *"Configure filters"*:
+>       - *"Exclude reads with any of the following flags set"*:
+>         `Read is unmapped` **and** `Mate is unmapped`
 >
 {: .hands_on}
 
@@ -582,15 +591,14 @@ This will result in three new datasets, one for each sample in the analysis.
 >
 > > <hands-on-title></hands-on-title>
 > >
-> > 1. **Filter SAM or BAM, output SAM or BAM** {% icon tool %}:
-> >   - {% icon param-files %} *"SAM or BAM file to filter"*: all 3 mapped
-> >     reads datasets of the family trio, outputs of **Map with BWA-MEM**
+> > 1. {% tool [Samtools view](toolshed.g2.bx.psu.edu/repos/iuc/samtools_view/samtools_view/1.15.1+galaxy0) %}:
+> >   - {% icon param-files %} *"SAM/BAM/CRAM data set"*: all 3 mapped reads
+> >     datasets of the family trio, outputs of **Map with BWA-MEM**
 > >     {% icon tool %}
-> >   - *"Filter on bitwise flag"*: `yes`
-> >     - *"Only output alignments with all of these flag bits set"*:
-> >       - {% icon param-check %} *"Read is mapped in a proper pair"*
-> >     - *"Skip alignments with any of these flag bits set"*:
-> >       - {% icon param-check %} *"The read is unmapped"*
+> >   - *"What would you like to look at?"*: `A filtered/subsampled selection of reads`
+> >     - In *"Configure filters"*:
+> >       - *"Require that these flags are set"*: `Read is mapped in a proper pair`
+> >       - *"Exclude reads with any of the following flags set"*: `Read is unmapped`
 > >
 > {: .hands_on}
 >
@@ -600,9 +608,9 @@ This will result in three new datasets, one for each sample in the analysis.
 
 > <hands-on-title>Remove duplicates</hands-on-title>
 >
-> 1. **RmDup** {% icon tool %} with the following parameters:
+> 1. {% tool [RmDup](toolshed.g2.bx.psu.edu/repos/devteam/samtools_rmdup/samtools_rmdup/2.0.1) %} with the following parameters:
 >   - {% icon param-files %} *"BAM file"*: all 3 filtered reads datasets; the
->     outputs of **Filter SAM or BAM**
+>     outputs of **Samtools view**
 >   - *"Is this paired-end or single end data"*: `BAM is paired-end`
 >     - *"Treat as single-end"*: `No`
 >
@@ -635,7 +643,7 @@ alignment.
 
 > <hands-on-title>Generating FreeBayes calls</hands-on-title>
 >
-> 1. Run **FreeBayes** {% icon tool %}:
+> 1. Run {% tool [FreeBayes](toolshed.g2.bx.psu.edu/repos/devteam/freebayes/freebayes/1.3.6+galaxy0) %}:
 >    - *"Choose the source for the reference genome"*: `Locally cached`
 >      - *"Run in batch mode?"*: `Merge output VCFs`
 >        - {% icon param-files %} *"BAM dataset(s)"*: all three mapped reads
@@ -647,8 +655,9 @@ alignment.
 >      > into your history instead:
 >      > - *"Choose the source for the reference genome"*: `History`
 >      >   - *"Run in batch mode?"*: `Merge output VCFs`
->      >     - {% icon param-files %} *"BAM dataset(s)"*: all three mapped
->      >       reads datasets of the family trio
+>      >     - {% icon param-files %} *"BAM or CRAM dataset(s)"*: all three
+>      >        mapped and fully post-processed reads datasets of the family trio;
+>      >        outputs of **RmDup**
 >      >   - {% icon param-file %} *"Use the following dataset as the
 >      >     reference sequence"*: your imported `hg19` fasta dataset.
 >      {: .comment}
@@ -758,7 +767,7 @@ standards in some other, less important respects is **bcftools norm**.
 
 > <hands-on-title>Post-processing FreeBayes calls</hands-on-title>
 >
-> 1. **bcftools norm** {% icon tool %} with the following parameters:
+> 1. {% tool [bcftools norm](toolshed.g2.bx.psu.edu/repos/iuc/bcftools_norm/bcftools_norm/1.15.1+galaxy3) %} with the following parameters:
 >    - *"VCF/BCF Data"*: the VCF output of **FreeBayes** {% icon tool %}
 >    - *"Choose the source for the reference genome"*: `Use a built-in genome`
 >      - *"Reference genome"*: `Human: hg19` (or a similarly named option)
@@ -772,6 +781,7 @@ standards in some other, less important respects is **bcftools norm**.
 >      {: .comment}
 >    - *"When any REF allele does not match the reference genome base"*:
 >      `ignore the problem (-w)`
+>    - *"Atomize"*: `No`
 >    - *"Left-align and normalize indels?"*: `Yes`
 >    - *"Perform deduplication for the folowing types of variant records"*:
 >      `do not deduplicate any records`
@@ -865,11 +875,12 @@ which is rather simple to generate manually.
 >     > <comment-title>Shortcut</comment-title>
 >     > You can skip this step if the Galaxy server you are working on offers
 >     > `Homo sapiens: hg19` as a locally installed snpEff database. You can
->     > check the **Genome source** select list of the **SnpEff eff**
->     > {% icon tool%} tool to see if this is the case.
+>     > check the **Genome source** select list of the
+>     > {% tool [SnpEff eff](toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy2) %}
+>     > tool to see if this is the case.
 >     {: .comment}
 >
->     Use **SnpEff Download** {% icon tool %} to download genome annotation
+>     Use {% tool [SnpEff Download](toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_download/4.3+T.galaxy2) %} to download genome annotation
 >     database `hg19`.
 > 2. Create a PED-formatted pedigree dataset describing our single-family sample trio:
 >
@@ -902,8 +913,8 @@ which is rather simple to generate manually.
 >
 >    > <details-title>More on PED files</details-title>
 >    >
->    > The PED format is explained in the help section of **GEMINI load**
->    > {% icon tool %}.
+>    > The PED format is explained in the help section of
+>    > {% tool [GEMINI load](toolshed.g2.bx.psu.edu/repos/iuc/gemini_load/gemini_load/0.20.1+galaxy2) %}.
 >    >
 >    > Take a moment and try to understand the information that is encoded in
 >    > the PED dataset we are using here.
@@ -917,7 +928,7 @@ knows how to parse SnpEff-annotated VCFs, while GEMINI output cannot be used
 with SnpEff.
 
 > <hands-on-title>Adding annotations with SnpEff</hands-on-title>
-> 1. **SnpEff eff** {% icon tool %}
+> 1. {% tool [SnpEff eff](toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy2) %}
 >    - {% icon param-file %} *"Sequence changes (SNPs, MNPs, InDels)"*: the
 >      output of **bcftools norm** {% icon tool %}
 >    - *"Input format"*: `VCF`
@@ -987,7 +998,7 @@ the database.
 
 > <hands-on-title>Creating a GEMINI database from a variants dataset</hands-on-title>
 >
-> 1. **GEMINI load** {% icon tool %} with
+> 1. {% tool [GEMINI load](toolshed.g2.bx.psu.edu/repos/iuc/gemini_load/gemini_load/0.20.1+galaxy2) %} with
 >    - {% icon param-file %} *"VCF dataset to be loaded in the GEMINI database"*:
 >      the output of **SnpEff eff** {% icon tool %}
 >    - *"The variants in this input are"*: `annotated with snpEff`
@@ -1124,7 +1135,7 @@ variants. Feel free to run analogous queries for other types of variants that
 you think could plausibly be causative for the child's disease.
 
 > <hands-on-title>Finding and reporting plausible causative variants</hands-on-title>
-> 1. **GEMINI inheritance pattern** {% icon tool %}
+> 1. {% tool [GEMINI inheritance pattern](toolshed.g2.bx.psu.edu/repos/iuc/gemini_inheritance/gemini_inheritance/0.20.1) %}
 >    - *"GEMINI database"*: the GEMINI database of annotated variants; output
 >      of **GEMINI load** {% icon tool %}
 >    - *"Your assumption about the inheritance pattern of the phenotype of interest"*:
@@ -1195,7 +1206,6 @@ you think could plausibly be causative for the child's disease.
 
 # Conclusion
 
-
 It was not hard to find the most likely causative mutation for the child's
 disease (you did find it, right?).
 

diff --git a/...t-analysis/tutorials/exome-seq/workflows/test-data/autosomal_recessive_pattern_report.tsv b/...t-analysis/tutorials/exome-seq/workflows/test-data/autosomal_recessive_pattern_report.tsv
@@ -0,0 +1,2 @@
+max_aaf_all	chrom	start	ref	alt	impact	gene	clinvar_sig	clinvar_disease_name	clinvar_gene_phenotype	rs_ids	variant_id	family_id	family_members	family_genotypes	samples	family_count
+3.24886289799e-05	chr8	86385979	G	A	stop_gained	CA2	None	None	carbonic_anhydrase_ii_variant|osteopetrosis_with_renal_tubular_acidosis	None	1	FAM	proband(proband;affected;male),mother(mother;unaffected;female),father(father;unaffected;male)	A/A,G/A,G/A	proband	1
diff --git a/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/father.bam b/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/father.bam
diff --git a/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/father_R1.fastq.gz b/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/father_R1.fastq.gz
diff --git a/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/father_R2.fastq.gz b/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/father_R2.fastq.gz
diff --git a/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/mother.bam b/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/mother.bam
diff --git a/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/mother_R1.fastq.gz b/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/mother_R1.fastq.gz
diff --git a/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/mother_R2.fastq.gz b/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/mother_R2.fastq.gz
diff --git a/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/pedigree.tsv b/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/pedigree.tsv
@@ -0,0 +1,4 @@
+#family_id    name     paternal_id    maternal_id    sex    phenotype
+FAM           father   0              0              1      1
+FAM           mother   0              0              2      1
+FAM           proband  father         mother         1      2
diff --git a/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/proband.bam b/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/proband.bam
diff --git a/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/proband_R1.fastq.gz b/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/proband_R1.fastq.gz
diff --git a/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/proband_R2.fastq.gz b/topics/variant-analysis/tutorials/exome-seq/workflows/test-data/proband_R2.fastq.gz
diff --git a/...variant-analysis/tutorials/exome-seq/workflows/workflow_exome_seq_from_premapped-test.yml b/...variant-analysis/tutorials/exome-seq/workflows/workflow_exome_seq_from_premapped-test.yml
@@ -0,0 +1,21 @@
+- doc: Test exome-seq training short version
+  job:
+    Father data:
+      class: File
+      path: test-data/father.bam
+      filetype: bam
+    Mother data:
+      class: File
+      path: test-data/mother.bam
+      filetype: bam
+    Proband data:
+      class: File
+      path: test-data/proband.bam
+      filetype: bam
+    PEDigree data:
+      class: File
+      path: test-data/pedigree.tsv
+      filetype: tabular
+  outputs:
+    candidate_mutations:
+      path: test-data/autosomal_recessive_pattern_report.tsv