Skip to content

4.0.5.0

Compare
Choose a tag to compare
@droazen droazen released this 07 Jun 22:56
· 1337 commits to master since this release
f4225b8

Highlights of this release include the ability to emit MNPs in Mutect2 and HaplotypeCaller via a new --max-mnp-distance argument, much better active region detection for low allele fractions in Mutect2, new priors for variants sites and homRef blocks in HaplotypeCaller, a new tool FilterAlignmentArtifacts to filter false positive alignment artifacts in the Mutect2 pipeline, performance improvements to CNNScoreVariants and Funcotator, and a new --sites-only-vcf-output GATK engine argument to suppress genotypes when writing VCFs.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes in this release:

  • Mutect2

    • Made Mutect2 active region determination much better for low allele fractions (#4832)
      • In particular, this makes Mutect2 vastly better for mitochondrial and cfDNA calling
    • Mutect2 can now emit MNPs according to adjustable distance threshold specified via --max-mnp-distance (#4650)
    • Tweaked Mutect2 read position filter to handle non-biological (eg FFPE) insertions better (#4851)
    • Fixed Mutect2 bug where triallelic normal artifacts were sometimes hidden from filtering engine (#4809)
    • Mutect2 STR filter now also looks at insertions (#4845)
      • This lowers the indel false positive rate dramatically.
    • Mutect2 WDL:
      • now outputs MAF segmentation (#4837)
      • now runs FilterAlignmentArtifacts (#4848)
      • now uses lenient validation in SortSam (#4844)
  • Added new tool FilterAlignmentArtifacts (#4698)

    • Filters false positive alignment artifacts (that is, apparent variants due to reads being mapped to the wrong genomic locus) from a VCF callset by checking variant-supporting reads and their mates.
    • By considering the realignment of the read and its mate, it saves a lot of variants, especially in low-complexity regions, from being filtered as mapping errors.
  • HaplotypeCaller

    • HaplotypeCaller can now emit MNPs according to adjustable distance threshold specified via --max-mnp-distance (#4650)
    • New HaplotypeCaller priors for variants sites and homRef blocks (#4793)
      • Added new --population-callset argument allowing an external panel of variants to be specified to inform the frequency distribution underlying the genotype priors
      • Added new --num-reference-samples-if-no-call argument to control whether to infer (and with what effective strength) that only reference alleles were observed at sites not seen in any panel
      • As a side effect of this change, CalculateGenotypePosteriors now supports indels.
    • GCS/NIO output support for the -bamout argument (#4721)
  • -new-qual in HaplotypeCaller/Mutect2/GenotypeGVCFs no longer counts spanning deletions as support for variant qual (#4801)

  • CNNScoreVariants

    • Performance improvements to the prep of the input tensors in the 2D model (#4735)
    • Bug fix to prevent a crash on the ends of the mitochondrial contig (#4751)
  • GATK Engine

    • Added a new traversal type TwoPassVariantWalker that does two passes over its input variants (#4744)
    • Enable the -L argument to read feature files (such as .bed or .vcf files) from non-local Paths, including GCS buckets (#4854)
    • Added --sites-only-vcf-output argument to the GATK engine to suppress genotype fields when writing VCFs (#4764)
    • Tools that use annotations now use the barclay annotation plugin (#4674)
    • Added new ReadQueryNameComparator (#4731)
    • Automatically schedule temporary resource files for delete on exit (#4616)
  • Spark tools

    • Added support for g.vcf.gz files in Spark. #4274 (#4463)
    • Spark tools can now write SAM files #4295. (#4471)
    • Added a --output-shard-tmp-dir argument to specify the parts directory for un-sharded BAM writing (#4666)
  • MarkDuplicatesSpark

    • Fixed MarkDuplicatesSpark so it handles supplementary reads with unmapped mates properly (#4785)
    • Added a distinction between PCR orientation and Optical Duplicates orientation in MarkDuplicatesSpark (#4752)
    • Fixed serialization crash in MarkDuplicatesSpark (#4778)
    • Fixed queryname partitioning bug where asking for queryname sort would result in reads with the same name being split between partitions (#4765)
    • Changed MarkDuplicatesSpark to sort non-queryname sorted bams before processing to ensure marking is consistent across shards (#4732)
    • Renamed some MarkDuplicatesSpark arguments to follow the "kabob-style" convention (#4715)
    • MarkDuplicatesSpark now uses the Picard OpticalDuplicatesFinder directly (#4750)
    • MarkDuplicatesSpark now uses Picard metrics code directly (#4779)
  • BwaSpark: disable sequence dictionary validation when aligning reads #4131 (#4308)

  • Funcotator

    • Major performance improvements due to added caching and other optimizations (#4740)
    • Various fixes (#4783) (#4817) (#4770)
      • Sanitize special characters when outputting VCF so that VCF validation passes
      • Ordering specified in the header did not match the variants and hg19/b37 - VCF datasources were being inconsistently processed, inducing a lot of missed annotations.
      • Added Funcotator tests for Clinvar and Gencode v28 in hg38, and mixed chr/no-chr GENCODE.
      • Eased restrictions so that Gencode v28 would be recognized as a valid gtf. Future versions of Gencode will not fail just based on the version number and warning will be emitted instead.
      • Refining handling of transcripts with missing sequence info.
      • Refactored UTR VariantClassification handling.
      • Added warning statement when a transcript in the UTR has no sequence info (now is the same behavior as in protein coding regions).
      • Added tests to prevent regression on data source date comparison bug.
      • Fixed DNA Repair Genes getter script.
      • Fixed an issue in COSMIC to make it robust to bad COSMIC data.
      • Gencode no longer crashes when given an indel that starts just before an exon.
      • Fixed the SimpleKeyXsvFuncotationFactory to allow any characters to work as delimiters (including characters used in regular expressions, such as pipes).
      • Modified several methods to allow for negative start positions in preparation for allowing indels that start outside exons.
      • Fixed an issue in 5' UTR processing that would cause variant alleles with length > 1 to throw an exception (fixes issue #4712).
      • Fixed a bug in the version detection for Funcotator data sources that would prevent newer data source versions from being detected as compatible (date comparison error).
    • Gencode data sources now have names preserved from config files. (#4823)
  • GCNV kernel tunings (#4720)

    • Fixed a minor issue in sampling error estimation that could lead to NaN (as a result of division by zero)
    • Introduced separate internal and external admixing rates
    • Introduced two-stage inference for cohort denoising and calling
    • Capped phred-scaled qualities to maximum values permitted by machine precision in order to avoid NaNs and overflows.
    • Took a first step toward tracking and logging parameters during inference, starting with the ELBO history.
  • Validation of sequence dictionaries from multiple BAMs now throws warning instead of exception in CNV workflows. (#4758)

  • SV tools

    • Tweak BWA to allow "gappier" alignments in local assemblies (#4708)
    • Added a new experimental tool named CpxVariantReInterprepterSpark to extract barebone-annotated simple variants from an GATK-SV discovery pipeline produced VCF containing complex variants (#4602)
    • Fix "UnhandledCaseSeen" error in StructuralVariationDiscoveryPipelineSpark (#4677)
  • Added new SingleSequenceReferenceAligner class to align against an on-the-fly single contig reference using Bwa-Mem (#4780)

  • Updates to the conda environment for Python-based tools (#4749)

    • Fix #4741, where newer versions of conda appear to treat relative references in the environment yml as being relative to the yml file instead of relative to the cwd (based on observation).
    • Add a second conda yml file (gatkcondaenv.intel.yml) for environments that use Intel hardware acceleration and the Intel Tensorflow package (based on #4735).
    • Added a gradle task (condaEnvironmentDefinition) to generate the conda yml files from a single template to ensure that all the environment definitions remain in sync. This task also generates the Python package archive.
    • Added a gradle task (localDevCondaEnv) to create or update a local (non-Intel) conda environment. This is a shortcut for use during development when you're iteratively changing/testing Python code and want to update the conda env.
  • Added a new WEX test bam to src/test/resources/large, with a companion target interval list (#4756)

  • Add slightly modified version of GATK3 github issue template (#4796)

  • Updated htsjdk to 2.15.1 (#4830)