Skip to content

Commit

Permalink
Merge pull request #4292 from wm75/exome-seq-modernize
Browse files Browse the repository at this point in the history
Update exome-seq tutorial/WFs to latest tool versions
  • Loading branch information
bgruening authored Jul 19, 2023
2 parents 7daef0c + c1a2672 commit 31e21a4
Show file tree
Hide file tree
Showing 16 changed files with 2,035 additions and 2,126 deletions.
94 changes: 52 additions & 42 deletions topics/variant-analysis/tutorials/exome-seq/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,16 +285,19 @@ data for either analysis.
> https://zenodo.org/record/3243160/files/hg19_chr8.fa.gz
> ```
>
> Make sure you specify the datatype as `fasta` in the import dialog.
> In the upload dialog, make sure you specify:
>
> - **Type**: `fasta`
> - **Genome**: `Human Feb. 2009 (GRCh37/hg19) (hg19)`
>
> Alternatively, load the dataset from a shared data library.
>
> 10. Rename the reference genome
>
> The reference genome you have imported above came as a compressed
> file, but got unpacked by Galaxy to plain `fasta` format according to
> your datatype selection. You may now wish to remove the `.gz` suffix
> from the dataset name.
> your datatype selection. At a minimum, you may now wish to remove the
> `.gz` suffix from the dataset name to avoid confusion.
>
{: .hands_on}

Expand Down Expand Up @@ -324,7 +327,7 @@ of NGS data quality control, you may want to have a look at the dedicated
tutorial on [Quality control]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %}).

> <hands-on-title>Quality control of the input datasets</hands-on-title>
> 1. Run **FastQC** {% icon tool %} on each of your six fastq datasets
> 1. Run {% tool [FastQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.74+galaxy0) %} on each of your six fastq datasets
> - {% icon param-files %} *"Short read data from your current history"*: all 6 FASTQ datasets selected with **Multiple datasets**
>
> {% snippet faqs/galaxy/tools_select_multiple_datasets.md %}
Expand All @@ -333,7 +336,7 @@ tutorial on [Quality control]({% link topics/sequence-analysis/tutorials/quality
> data, another one with an html report of the findings for each input
> dataset) will get added to your history.
>
> 2. Use **MultiQC** {% icon tool %} to aggregate the raw **FastQC** data of all input datasets into one report
> 2. Use {% tool [MultiQC](toolshed.g2.bx.psu.edu/repos/iuc/multiqc/multiqc/1.11+galaxy1) %} to aggregate the raw **FastQC** data of all input datasets into one report
> - In *"Results"*
> - *"Which tool was used generate logs?"*: `FastQC`
> - In *"FastQC output"*
Expand All @@ -356,7 +359,7 @@ tutorial on [Quality control]({% link topics/sequence-analysis/tutorials/quality
> > > consider trimming the 3' ends of reads (base qualities decline
> > > slightly towards the 3' ends) or to filter out the small fraction
> > > of reads with a mean base quality < 5.
> > > Feel free to run, *e.g.*, **Trimmomatic** {% icon tool %} on the
> > > Feel free to run, *e.g.*, {% tool [Trimmomatic](toolshed.g2.bx.psu.edu/repos/pjbriggs/trimmomatic/trimmomatic/0.38.1) %} on the
> > > fastq datasets if you want to, but don't expect this to have a big
> > > effect on the analysis given the high overall quality of the data
> > > of all samples.
Expand Down Expand Up @@ -390,7 +393,7 @@ NGS reads datasets to the human reference genome. We recommend you to follow
the dedicated [Mapping tutorial]({% link topics/sequence-analysis/tutorials/mapping/tutorial.md %}), if you need a general introduction to read mapping.

> <hands-on-title>Read Mapping</hands-on-title>
> 1. **Map with BWA-MEM** {% icon tool %} to map the reads from the **father** sample to the reference genome
> 1. {% tool [Map with BWA-MEM](toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2) %} to map the reads from the **father** sample to the reference genome
> - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a built-in genome index`
> - *"Using reference genome"*: `Human: hg19` (or a similarly named option)
>
Expand Down Expand Up @@ -446,7 +449,7 @@ the dedicated [Mapping tutorial]({% link topics/sequence-analysis/tutorials/mapp
> >
> {: .warning}
>
> 2. **Map with BWA-MEM** {% icon tool %} to map the reads from the **mother** sample to the reference genome **using the same parameters as before** except
> 2. {% tool [Map with BWA-MEM](toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2) %} to map the reads from the **mother** sample to the reference genome **using the same parameters as before** except
>
> - *"Single or Paired-end reads"*: `Paired`
> - {% icon param-file %} *"Select first set of reads"*: the
Expand All @@ -460,7 +463,7 @@ the dedicated [Mapping tutorial]({% link topics/sequence-analysis/tutorials/mapp
> - *"Auto-assign"*: `No`
> - *"Read group sample name (SM)"*: `mother`
>
> 3. **Map with BWA-MEM** {% icon tool %} to map the reads from the **child** sample to the reference genome **using the same parameters as before** except
> 3. {% tool [Map with BWA-MEM](toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2) %} to map the reads from the **child** sample to the reference genome **using the same parameters as before** except
>
> - *"Single or Paired-end reads"*: `Paired`
> - {% icon param-file %} *"Select first set of reads"*: the
Expand All @@ -479,8 +482,16 @@ the dedicated [Mapping tutorial]({% link topics/sequence-analysis/tutorials/mapp
# Mapped reads postprocessing

At this point in the analysis you should have obtained three mapped reads
datasets in `bam` format. Ideally, these would carry `#father`, `#mother` and
`#child` tags for quick identification of the samples they provide data for.
datasets in `bam` format. Each of these datasets should:

- have its *database* set to the key `hg19`

Please correct any missing (`?`) or wrong keys now!

{% snippet faqs/galaxy/datasets_change_dbkey.md dbkey="Human Feb. 2009 (GRCh37/hg19) (hg19)" %}

- ideally, carry one of the `#father`, `#mother` or `#child` tags
for quick identification of the samples they provide data for.

In principle, you could use these datasets directly for variant calling, and in
many cases, including this one, this would be sufficient to identify the
Expand Down Expand Up @@ -532,17 +543,15 @@ To produce new filtered BAM datasets with only mapped reads the mate of which is

> <hands-on-title>Filtering for read pair mapping status</hands-on-title>
>
> 1. **Filter SAM or BAM, output SAM or BAM** {% icon tool %} with the following
> 1. {% tool [Samtools view](toolshed.g2.bx.psu.edu/repos/iuc/samtools_view/samtools_view/1.15.1+galaxy0) %} with the following
> parameters (leaving non-mentioned ones at their defaults):
> - {% icon param-files %} *"SAM or BAM file to filter"*: all 3 mapped reads
> - {% icon param-files %} *"SAM/BAM/CRAM data set"*: all 3 mapped reads
> datasets of the family trio, outputs of **Map with BWA-MEM**
> {% icon tool %}
> - *"Filter on bitwise flag"*: `yes`
> - *"Only output alignments with all of these flag bits set"*:
> Do not select anything here!
> - *"Skip alignments with any of these flag bits set"*:
> - {% icon param-check %} *"The read is unmapped"*
> - {% icon param-check %} *"The mate is unmapped"*
> - *"What would you like to look at?"*: `A filtered/subsampled selection of reads`
> - In *"Configure filters"*:
> - *"Exclude reads with any of the following flags set"*:
> `Read is unmapped` **and** `Mate is unmapped`
>
{: .hands_on}

Expand Down Expand Up @@ -582,15 +591,14 @@ This will result in three new datasets, one for each sample in the analysis.
>
> > <hands-on-title></hands-on-title>
> >
> > 1. **Filter SAM or BAM, output SAM or BAM** {% icon tool %}:
> > - {% icon param-files %} *"SAM or BAM file to filter"*: all 3 mapped
> > reads datasets of the family trio, outputs of **Map with BWA-MEM**
> > 1. {% tool [Samtools view](toolshed.g2.bx.psu.edu/repos/iuc/samtools_view/samtools_view/1.15.1+galaxy0) %}:
> > - {% icon param-files %} *"SAM/BAM/CRAM data set"*: all 3 mapped reads
> > datasets of the family trio, outputs of **Map with BWA-MEM**
> > {% icon tool %}
> > - *"Filter on bitwise flag"*: `yes`
> > - *"Only output alignments with all of these flag bits set"*:
> > - {% icon param-check %} *"Read is mapped in a proper pair"*
> > - *"Skip alignments with any of these flag bits set"*:
> > - {% icon param-check %} *"The read is unmapped"*
> > - *"What would you like to look at?"*: `A filtered/subsampled selection of reads`
> > - In *"Configure filters"*:
> > - *"Require that these flags are set"*: `Read is mapped in a proper pair`
> > - *"Exclude reads with any of the following flags set"*: `Read is unmapped`
> >
> {: .hands_on}
>
Expand All @@ -600,9 +608,9 @@ This will result in three new datasets, one for each sample in the analysis.

> <hands-on-title>Remove duplicates</hands-on-title>
>
> 1. **RmDup** {% icon tool %} with the following parameters:
> 1. {% tool [RmDup](toolshed.g2.bx.psu.edu/repos/devteam/samtools_rmdup/samtools_rmdup/2.0.1) %} with the following parameters:
> - {% icon param-files %} *"BAM file"*: all 3 filtered reads datasets; the
> outputs of **Filter SAM or BAM**
> outputs of **Samtools view**
> - *"Is this paired-end or single end data"*: `BAM is paired-end`
> - *"Treat as single-end"*: `No`
>
Expand Down Expand Up @@ -635,7 +643,7 @@ alignment.

> <hands-on-title>Generating FreeBayes calls</hands-on-title>
>
> 1. Run **FreeBayes** {% icon tool %}:
> 1. Run {% tool [FreeBayes](toolshed.g2.bx.psu.edu/repos/devteam/freebayes/freebayes/1.3.6+galaxy0) %}:
> - *"Choose the source for the reference genome"*: `Locally cached`
> - *"Run in batch mode?"*: `Merge output VCFs`
> - {% icon param-files %} *"BAM dataset(s)"*: all three mapped reads
Expand All @@ -647,8 +655,9 @@ alignment.
> > into your history instead:
> > - *"Choose the source for the reference genome"*: `History`
> > - *"Run in batch mode?"*: `Merge output VCFs`
> > - {% icon param-files %} *"BAM dataset(s)"*: all three mapped
> > reads datasets of the family trio
> > - {% icon param-files %} *"BAM or CRAM dataset(s)"*: all three
> > mapped and fully post-processed reads datasets of the family trio;
> > outputs of **RmDup**
> > - {% icon param-file %} *"Use the following dataset as the
> > reference sequence"*: your imported `hg19` fasta dataset.
> {: .comment}
Expand Down Expand Up @@ -758,7 +767,7 @@ standards in some other, less important respects is **bcftools norm**.

> <hands-on-title>Post-processing FreeBayes calls</hands-on-title>
>
> 1. **bcftools norm** {% icon tool %} with the following parameters:
> 1. {% tool [bcftools norm](toolshed.g2.bx.psu.edu/repos/iuc/bcftools_norm/bcftools_norm/1.15.1+galaxy3) %} with the following parameters:
> - *"VCF/BCF Data"*: the VCF output of **FreeBayes** {% icon tool %}
> - *"Choose the source for the reference genome"*: `Use a built-in genome`
> - *"Reference genome"*: `Human: hg19` (or a similarly named option)
Expand All @@ -772,6 +781,7 @@ standards in some other, less important respects is **bcftools norm**.
> {: .comment}
> - *"When any REF allele does not match the reference genome base"*:
> `ignore the problem (-w)`
> - *"Atomize"*: `No`
> - *"Left-align and normalize indels?"*: `Yes`
> - *"Perform deduplication for the folowing types of variant records"*:
> `do not deduplicate any records`
Expand Down Expand Up @@ -865,11 +875,12 @@ which is rather simple to generate manually.
> > <comment-title>Shortcut</comment-title>
> > You can skip this step if the Galaxy server you are working on offers
> > `Homo sapiens: hg19` as a locally installed snpEff database. You can
> > check the **Genome source** select list of the **SnpEff eff**
> > {% icon tool%} tool to see if this is the case.
> > check the **Genome source** select list of the
> > {% tool [SnpEff eff](toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy2) %}
> > tool to see if this is the case.
> {: .comment}
>
> Use **SnpEff Download** {% icon tool %} to download genome annotation
> Use {% tool [SnpEff Download](toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_download/4.3+T.galaxy2) %} to download genome annotation
> database `hg19`.
> 2. Create a PED-formatted pedigree dataset describing our single-family sample trio:
>
Expand Down Expand Up @@ -902,8 +913,8 @@ which is rather simple to generate manually.
>
> > <details-title>More on PED files</details-title>
> >
> > The PED format is explained in the help section of **GEMINI load**
> > {% icon tool %}.
> > The PED format is explained in the help section of
> > {% tool [GEMINI load](toolshed.g2.bx.psu.edu/repos/iuc/gemini_load/gemini_load/0.20.1+galaxy2) %}.
> >
> > Take a moment and try to understand the information that is encoded in
> > the PED dataset we are using here.
Expand All @@ -917,7 +928,7 @@ knows how to parse SnpEff-annotated VCFs, while GEMINI output cannot be used
with SnpEff.

> <hands-on-title>Adding annotations with SnpEff</hands-on-title>
> 1. **SnpEff eff** {% icon tool %}
> 1. {% tool [SnpEff eff](toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy2) %}
> - {% icon param-file %} *"Sequence changes (SNPs, MNPs, InDels)"*: the
> output of **bcftools norm** {% icon tool %}
> - *"Input format"*: `VCF`
Expand Down Expand Up @@ -987,7 +998,7 @@ the database.

> <hands-on-title>Creating a GEMINI database from a variants dataset</hands-on-title>
>
> 1. **GEMINI load** {% icon tool %} with
> 1. {% tool [GEMINI load](toolshed.g2.bx.psu.edu/repos/iuc/gemini_load/gemini_load/0.20.1+galaxy2) %} with
> - {% icon param-file %} *"VCF dataset to be loaded in the GEMINI database"*:
> the output of **SnpEff eff** {% icon tool %}
> - *"The variants in this input are"*: `annotated with snpEff`
Expand Down Expand Up @@ -1124,7 +1135,7 @@ variants. Feel free to run analogous queries for other types of variants that
you think could plausibly be causative for the child's disease.

> <hands-on-title>Finding and reporting plausible causative variants</hands-on-title>
> 1. **GEMINI inheritance pattern** {% icon tool %}
> 1. {% tool [GEMINI inheritance pattern](toolshed.g2.bx.psu.edu/repos/iuc/gemini_inheritance/gemini_inheritance/0.20.1) %}
> - *"GEMINI database"*: the GEMINI database of annotated variants; output
> of **GEMINI load** {% icon tool %}
> - *"Your assumption about the inheritance pattern of the phenotype of interest"*:
Expand Down Expand Up @@ -1195,7 +1206,6 @@ you think could plausibly be causative for the child's disease.

# Conclusion


It was not hard to find the most likely causative mutation for the child's
disease (you did find it, right?).

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
max_aaf_all chrom start ref alt impact gene clinvar_sig clinvar_disease_name clinvar_gene_phenotype rs_ids variant_id family_id family_members family_genotypes samples family_count
3.24886289799e-05 chr8 86385979 G A stop_gained CA2 None None carbonic_anhydrase_ii_variant|osteopetrosis_with_renal_tubular_acidosis None 1 FAM proband(proband;affected;male),mother(mother;unaffected;female),father(father;unaffected;male) A/A,G/A,G/A proband 1
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#family_id name paternal_id maternal_id sex phenotype
FAM father 0 0 1 1
FAM mother 0 0 2 1
FAM proband father mother 1 2
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
- doc: Test exome-seq training short version
job:
Father data:
class: File
path: test-data/father.bam
filetype: bam
Mother data:
class: File
path: test-data/mother.bam
filetype: bam
Proband data:
class: File
path: test-data/proband.bam
filetype: bam
PEDigree data:
class: File
path: test-data/pedigree.tsv
filetype: tabular
outputs:
candidate_mutations:
path: test-data/autosomal_recessive_pattern_report.tsv
Loading

0 comments on commit 31e21a4

Please sign in to comment.