Skip to content

Releases: broadinstitute/gatk

4.0.9.0

20 Sep 17:11
6e352bb
Compare
Choose a tag to compare

Highlighting this release are some important fixes and improvements to the HaplotypeCaller, in particular support for genotyping spanning deletions and a fix to the reference confidence calculation around indels. This release also brings support for "Requester Pays" GCS (Google Cloud Storage) buckets, fasta.gz support to the -R/--reference argument, a port of LeftAlignAndTrimVariants from GATK3, a new tool FuncotatorDataSourceDownloader to download Funcotator datasources, and bug fixes to Mutect2, VariantRecalibrator, and SelectVariants.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

  • HaplotypeCaller

    • Fixed the reference confidence calculation upstream of indels (#5172)
      • Improve hom-ref GQs near indels in GVCFs. Also consider bases on either side of indels informative if local assembly has been performed.
      • The previous behavior generated some PL=0,0,0 no-calls because the CIGAR of reads containing indels wasn't taken into account when determining which reads were informative for the indel reference confidence model. The local realignment wasn't being used inside the active region previously either, which has been fixed. A related change considers bases on either side of indels informative if local assembly has been performed (but not during active region detection). Both result in far fewer 0,0,0 calls. Unfortunately there are still some 0,0,X homRef calls related to #5171.
    • Make HaplotypeCaller genotype and output spanning deletions (#4963)
      • Modifies HaplotypeCaller so that it can output and genotype spanning deletion alleles represented by the * allele.
      • Fixes #2960
      • Previously, the output of HaplotypeCaller would not include spanning deletion alleles when run in single sample VCF mode or in genotype given alleles mode, even when that genotype would be more appropriate. In the joint calling workflow GenotypeGVCFs adds genotypes for spanning deletions, although the input likelihoods will not be broken out to specifically account for spanning deletion alleles.
    • Simplify HaplotypeBAMWriter code. #944 (#5122)
  • Mutect2

    • Mutect2 now emits DP values in the FORMAT field (#5185)
    • Add --get-af-from-ad option to recalculate the allele fraction based on AD instead of the Bayesian estimate (#5118)
      • Recommended for mitochondrial applications
    • Fixed a StringIndexOutOfBoundsException crash in the ReferenceBases annotation when a variant is within 10 base pairs of the end of a chromosome (#5151)
    • Restore base quality filter code that got removed unintentionally in #4895. (#5123)
    • Remove extra space in the MutectVersion header line (previously was Mutect Version) (#5184)
  • Added support for "Requester Pays" GCS (Google Cloud Storage) buckets via new --gcs-project-for-requester-pays argument (#5140)

  • Added fasta.gz support to the -R/--reference argument in walker tools (#5120)

  • Added GCS/NIO support to the --tmp-dir argument (#4469)

  • Upgraded google-cloud-java to the official 0.62.0 release, and move off of our custom fork of the library. This release includes the retry for transient 502 errors that we added to our fork in GATK 4.0.8.0 (#5194) (#5135)

  • Ported the LeftAlignAndTrimVariants tool from GATK3 (#5144)

  • VariantRecalibrator: the serialized model now sets annotation order (#3655)

    • This addresses a problem where serialized GMMs for VQSR assumed that the annotation order would be the same between the commands that generated them and the commands that used them. VQSR no longer depends on the commandline order of the annotations.
  • SelectVariants: Drop sites with the * allele as the only ALT when running with --exclude-non-variants (#5129)

  • Funcotator:

    • Created a new FuncotatorDataSourceDownloader tool to download data sources. (#5150)
    • Add an experimental FilterFuncotations tool (#4991)
    • Updated COSMIC to annotate protein change strings with their counts. (#5181)
    • Fix INDEL start/stop position and alleles for VCF gencode output. (#5131)
    • Get datasource version from a manifest file instead of the README (#5149)
    • Extract a new FuncotatorEngine to make it easier to write additional tools in the future that leverage Funcotator's annotation engine (#5134)
    • Handle character encoding error cases. (#5124)
  • CNNScoreVariants:

    • Add WDLs and JSONs to run CNNScoreVariants in a single-sample workflow (#4774)
    • Added --python-profile argument to enable Python profiling. (#4953)
  • CNV tools:

    • Produce an IGV-compatible seg file alongside the copy ratio calls in CallCopyRatioSegments (#5115)
    • Added optional mappability and segmental-duplication annotation to AnnotateIntervals. (#5162)
    • Improvements and refactoring of the Nucleotide class (#4846)
  • SV tools:

    • Bug fix to read name mangling in ExtractOriginalAlignmentRecordsByNameSpark (#5107)
    • Added an InsertSizeDistribution class to represent expected insert-size distribution (normal and log-normal distributed) parameterized by insert size mean and stddev (#4827)
    • Added documentation clarification and additional validation to SVInterval (#5157)
    • Test and utils clean up (#5116)
  • MarkDuplicatesSpark:

    • Switched MarkDuplicatesSpark tile-parsing code to use shorts in order to match Picard (#5165)
    • Added better error messages around missing read groups in MarkDuplicatesSpark (#5177)
  • Clone read base qualities rather than reference them directly in the read clipper code to prevent unsafe array operations (#4926)

  • Fix three bugs in the AlignmentUtils class (#3494)

    • The treatment of D-over-D in function applyCigarToCigar() was backward.
    • In function createReadAlignedToRef() the read start position passed to the leftAlignIndel() call was incorrect if the haplotype has an indel relative to reference.
    • When the leftAlignIndel() call drops any leading D operator in the result cigar, the read start position needs to be adjusted accordingly.
  • Test infrastructure improvements:

    • Split out gatk-testUtils as a separate artifact in our build system(#5112)
    • Skip push builds if there is a pull request (cuts down on total number of travis builds by about half) (#5156)
    • We now share the test settings between the main build and the docker tests (#5155)
  • Documented use of --temp-dir with GenomicsDBImport. (#5047)

  • Deleted obsolete experimental tool MarkDuplicatesGATK in favor of MarkDuplicatesSpark (#5166)

  • Deleted obsolete experimental tool BaseRecalibratorSparkSharded (#5192)

  • Upgraded htsjdk to version 2.16.1 (#5168)

  • Upgraded Picard to version 2.18.13. (#5173)

4.0.8.1

16 Aug 19:47
Compare
Choose a tag to compare

This is a small bug fix release to fix an issue with unpaired reads in Mutect2, as well as small fixes and improvements to Funcotator, FilterVariantTranches, and MarkDuplicatesSpark.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes in this release:

  • Mutect2: Fixed a "Cannot get mate information for an unpaired read" error that could occur with certain datasets containing unpaired reads that pass all the M2 read filters and show evidence of a SNV (#5121)

  • Funcotator:

    • Fixes to the splice site logic. (#5106)
      • Funcotator now ignores leading indel bases when checking if variants are within the splice site boundaries (eg. if a leading base in an indel, which is preserved between the reference and alternate alleles, is within the splice site boundary but the bases that have been changed are NOT, then the variant is now correctly labeled as NOT a splice site).
    • Populate the DB SNP validation status field properly (#5046)
      • Funcotator will now populate the MAF DB SNP Validation status field with proper values (e.g. "by1000genomes") instead of boolean value (e.g. "TRUE")
      • Funcotator now handles multiple records in a VCF funcotation factory that have the same pos, ref, and alt combination, even if equivalent and not exact matches.
  • FilterVariantTranches:

    • Add an --invalidate-previous-filters argument to remove old filters left over from previous runs (off by default) (#5042)
    • Add --snp-tranche and --indel-tranche arguments to replace the previous --tranche argument (#5042)
  • Updated MarkDuplicatesSpark scoring and comparison code to reflect changes in Picard (#5023)

    • Updated the scoring code to no longer take into account the unclipped start position of mismatching reads. Also changed the score to be a double packed short value in order to better reflect Picard scoring code.
  • Other Changes:

    • Added new IOUtils.isHDF5File() utility method (#5082)
    • Add jitpack support for building GATK snapshots (#5056)
    • Fixed broken link in Travis to docker test failure reports (#5108)

4.0.8.0

14 Aug 20:22
Compare
Choose a tag to compare

This release features some significant changes to Mutect2 that improve both performance and correctness, as well as a bug fix to GenomicsDBImport for large interval lists.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes in this release:

  • Mutect2

    • Handle overlapping mates in M2 active region detection, causing fewer false active regions (#5078)
      • Makes Mutect2 ~25% faster in many cases with no loss of accuracy!
    • Filter M2 calls that are near other filtered calls on the same haplotype (#5092)
      • A very effective new filter that significantly reduces false positives
    • New Orientation Bias Filter (#4895)
      • New, improved orientation bias model, without which the M2 pipeline is not viable for NovaSeq data.
    • Changed the default AF slightly for M2 tumor-only mode (just a small tweak) (#5067)
    • Optimize some Mutect-related tools (#5073)
      • Everything that inherits from AbstractConcordanceWalker (this includes the Concordance tool and MergeMutect2CallsWithMC3) is now much faster on the cloud
    • Fixed edge case for M2 palindrome transformer (#5080)
      • Fixed an edge case involving reads assigned huge fragment lengths
    • Allowing counts for supporting alt reads in the validation normal. (#5062)
      • Added useful information suggesting possible normal artifacts in somatic validation tool.
    • M2 wdl doesn't emit unfiltered vcf, which is redundant (#5076)
  • GenomicsDBImport

    • Fix for issue where we could run out of file handles when working with large interval lists (#5105)
    • Display warning when using large interval lists with GenomicsDBImport (#5102)
  • Updated MarkDuplicatesSpark tie-breaking rules to reflect changes in picard (#5011)

  • Added the ability for CompareDuplicatesSpark to output mismatching reads (#4894)

  • Updated our google-cloud-java fork to 0.20.5-alpha-GCS-RETRY-FIX (#5099)

    • We now retry on 502 and UnknownHostException errors when using NIO
  • SV Tools:

    • Various improvements (#4996)
      • output a single VCF for new interpretation tool
      • bring MAX_ALIGN_LENGTH and MAPPING_QUALITIES annotations from CPX variants to re-interpreted simple variants
      • add new CLI argument and filter assembly based variants based on annotation MAPPING_QUALITIES, MAX_ALIGN_LENGTH
      • filter out variants of size < 50
    • Bug fix for the extreme edge case where after alignments de-overlapping, an alignment block is only 1 base long (#4962)
    • Turn back on checking variant info fields against header in SV vcf writing (turned off temporarily long time ago but slipped attention after implementation stablized) (#5084)

4.0.7.0

30 Jul 18:47
b6a630a
Compare
Choose a tag to compare

Some important fixes in this release include a new version of GenomicsDB with a fix for the stack overflow seen when using large interval lists, and an updated Docker image with a fix for the missing R/ggplot2 dependencies.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/.

Docker

  • Restore missing R/ggplot2 dependencies on the Docker image. [#5040 (https://github.com//pull/5040)

GenomicsDB

  • Fix GenomicsDBImport stack overflow when using large number of intervals #4997

Mutect2

  • Don't use very short stubs of clipped reads for genotyping #5057
  • Add maxRetries to runtime in M2 WDLs #5049
  • Fix an edge case bug in PalindromeArtifactReadTransformer #5038
  • Make orientation bias filtering default to true #5019
  • Added option for ValidateBasicSomaticShortMutations to output a vcf #4999
  • Add Mutect2 PalindromeArtifactReadTransformer to hard clip inverted tandem repeats insertion artifacts #4998
  • Making MAF become the output of Funcotator in M2 WDL and multiple transcript fix. #4941

CNV Tools

  • Exposed ability to blacklist intervals in CNV WDLs. #5027
  • Added output of IGV-compatible .seg files to ModelSegments. #5048

Structural Variants

  • Add BreakpointEvidence filter based on classifier #4769
  • Address more edge cases in assembly alignments #5044
  • Refactor AssemblyContigAlignmentsConfigPicker #4971
  • Fix an edge case in assembly contig alignment picker where no good mappings to canonical mappings exist #5005
  • Trim down ref bases for CPX variants #4970

Funcotator

  • VCF Funcotation Factory will recognize equivalent alleles (even when not exact) #4977

Other

  • Include docs for new variant quality score model #5008
  • Engine changes related to migration of GATK3 VariantEval to GATK4 #4495
  • Fix position annotations to use position in original, not clipped, read #4956
  • Add cmd line to VCF generated by GATKSparkTool #4981

4.0.6.0

06 Jul 22:02
Compare
Choose a tag to compare

Highlights of this release include:

  • A new version of GenomicsDB that brings many long-requested features such as support for multiple intervals in GenomicsDBImport
  • A significantly (~33%) smaller GATK docker image
  • An important bug fix for the -new-qual option in GenotypeGVCFs/HaplotypeCaller/Mutect2

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes in this release:

  • GenomicsDB: new version with many long-awaited features and bug fixes (#4645)

    • Multi-interval support in GenomicsDBImport (#3269)
      • Now you can specify multiple -L intervals when importing variants into GenomicsDB using GenomicsDBImport, instead of having to specify one interval per invocation.
    • New protobuf-based API to allow configuration without editing JSON files
    • Support for sites-only queries
    • Support for returning the genotype (GT) field in queries
    • Fixed bug where records with spanning deletion alleles could cause reads from GenomicsDB to fail (#4716)
  • Reduced the size of the GATK docker image by approximately 33%, from ~5.3 GB to ~3.5 GB (#4955)

  • Fixed a regression in the -new-qual option for GenotypeGVCFs/HaplotypeCaller/Mutect2 that was introduced in GATK 4.0.5.0 (#4980)

    • There was a precision issue in the AlleleFrequencyCalculator when running with -new-qual that could cause a crash at certain sites (specifically, sites with spanning deletions and highly unlikely alt alleles).
  • HaplotypeCaller: don't count qual = 0 sites as polymorphic for GVCF mode (#4967)

  • ValidateBasicSomaticShortMutations: added a new optional argument to produce summary table output (#4982)

  • ExtractOriginalAlignmentRecordsByNameSpark: added a new optional argument to invert the logic in the read-name filtering (#4944)

  • Separated out the "variant calling" integration tests from the rest of the integration tests to speed up overall test suite runtime in travis (#4984)

4.0.5.2

29 Jun 21:01
Compare
Choose a tag to compare

Highlights of this release include major Funcotator performance improvements on hg19/b37 inputs, a newly rewritten Java version of FilterVariantTranches, HaplotypeCaller bamout improvements, and improved Python integration by eliminate timeouts.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/.

Funcotator Improvements

  • Improve handling of hg19/B37 references (#4586).
    • Fixed performance bug involving excessive cache misses when querying datasources, resulting in major
      performance improvements when running on HG19/B37 data (performance increased by approx. 30x with v1.4.20180615 of
      the standard Funcotator data sources) (#4586).
    • Automatically detect when B37 data run against hg19 data source and convert contig names to be hg19 compliant.
    • Assumes all data sources for the hg19 reference are compliant with hg19 contig names. User-created data
      sources will have to honor this.
    • Perform additional validation on input data to ensure a given reference FASTA has a sequence
      dictionary that is a superset of the given VCF. This is a more stringent check than is automatically
      performed by the GATK. Can be disabled with the --disable-sequence-dictionary-validation flag.
    • Released new version of datasources to go with this release (1.4.20180615), necessary because the data
      sources needed to be made consistent with hg19 (before they were a mix of hg19 and b37 contig names).
    • Updated the minimum required data source version to be the latest release.
    • Updated the getDbSNP.sh and createSqliteCosmicDb.sh data source scripts to preprocess those data sources
      to have hg19-compliant contigs names.
    • Removed the --allow-hg19-gencode-b37-contig-matching flag.
    • Removed the --allow-hg19-gencode-b37-contig-matching-override flag.
  • User defined transcripts were being used as a filter rather than a priority order. The filtering step has been eliminated. Fixes #4918 (#4931)
  • Added custom MAF fields to MafOutputRenderer (#4917)
  • LocatableXsv data sources now produce at most 1 funcotation per allele pair. (#4936)
  • LocatableXsv data sources now provide the correct number of funcotations (#4915)
  • Preserve VCF fields in MAF output (#4872)
  • Fixing error when spanning deletions overlap coding regions (#4881)

HaplotypeCaller/Mutect2

  • Improvements to FilterMutectCalls. Eliminates about 3% of all false positives in DREAM while reducing sensitivity by about 0.1%
  • Fix many questionable -bamout alignments where, because of a bad choice of Smith-Waterman parameters,
    deletions were preferred over single-base substitutions.(#4858)
    Result is many fewer spurious indels in the -bamout output.
  • Introduced new SmithWaterman parameters affecting realignment of the reads to their best haplotype. This
    also changes some annotations that depend on the alignment, such as BaseQualityRankSum and ReadPositionRankSum.
    The changes are slight and make things more correct.
  • Modify the behavior of (BaseGraph) getNextReferenceVertex for non-ref paths (#4889)

FilterVariantTranches

  • Rewrite VCF Tranche filtering in java, with tests (#4800)

Engine

  • StreamingPythonExecutor no longer uses timeouts or relies on prompt synchronization. (#4757)
  • Allow concordance tools (AbstractConcordanceWalker) to use NIO for truth call set (#4905)
  • Add pre- and post- apply variant transformer to VariantWalkerBase

MarkDuplicatesSpark

  • Fixed a missing special case in MarkDuplicates ReadsKey code to better match current picard results (#4899)
  • Reworked the keys for MarkDuplicatesSpark to be sufficient for grouping on their own. (4878)
  • Improve error message for MarkDuplicates duplicates readnames issues (#4879)

Structural Variants

  • Add tests for AssemblyContigWithFineTunedAlignments (#4961)
  • Fix no index output for assembly bam file (#4945)
  • Overhaul tests on assembly-based non-complex breakpoint and type inference code (#4835)
  • Simple fix to remove trailing slash in GCS_SAVE_PATH to avoid double slashes in GCS_RESULTS_DIR (#4873)

Misc:

  • Upgrading picard 2.18.2 -> 2.18.7 (#4949)
  • Update htsjdk 2.15.1 -> 2.16.0 (#4914)
  • Added support to PrintReadsSpark for non-coordinate sorted bams (#4853)
  • Adding --sort-order option to SortSamSpark (#4545)
  • Increased boot disk size on GATK tasks in M2 wdl to accomodate 4.0.5.0 docker (#4877)

4.0.5.1

11 Jun 17:26
Compare
Choose a tag to compare

This is primarily a bug fix release to fix a crash in the help system (#4875). The issue was that tools that use annotations (which includes Mutect2, HaplotypeCaller, GenotypeGVCFs, CombineGVCFs, and VariantAnnotator) would crash when trying to print their help text. This could be triggered by running with an explicit --help, or by typing an invalid tool command line.

This release also brings in some improvements to Funcotator, including a new mode to output annotations for all transcripts.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes in this release:

  • Fix crash when displaying help text for tools that use annotations (#4876)
  • Funcotator improvements (#4838) (#4870)
    • Added ALL mode for transcript selection (--transcript-selection-mode ALL) which will output full annotation fields for all transcripts
    • IGR annotation are no longer reported if there are any transcripts that would result in a non-IGR annotation for a given variant
    • VCF Datasources now have to match both the alt and ref alleles to be added as annotations to a variant
    • Added the --allow-hg19-gencode-b37-contig-matching-override flag to allow for even more permissive matching contig names between B37 and HG19 references (primarily designed to be used in development)
    • Updated the experimental Funcotator WDL to work properly in cromwell
    • Refactored internals of Funcotator to use FuncotationMap objects to store annotations
    • Additional tests to ensure VCF and MAF protein change strings are equivalent
    • Other minor internal bugfixes for testing
  • Fix to the Oncotator command line in the Mutect2 WDL (#4862)
  • Removed unsupported Mutect2 WDLs (these now live on Firecloud) (#4836)

4.0.5.0

07 Jun 22:56
f4225b8
Compare
Choose a tag to compare

Highlights of this release include the ability to emit MNPs in Mutect2 and HaplotypeCaller via a new --max-mnp-distance argument, much better active region detection for low allele fractions in Mutect2, new priors for variants sites and homRef blocks in HaplotypeCaller, a new tool FilterAlignmentArtifacts to filter false positive alignment artifacts in the Mutect2 pipeline, performance improvements to CNNScoreVariants and Funcotator, and a new --sites-only-vcf-output GATK engine argument to suppress genotypes when writing VCFs.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes in this release:

  • Mutect2

    • Made Mutect2 active region determination much better for low allele fractions (#4832)
      • In particular, this makes Mutect2 vastly better for mitochondrial and cfDNA calling
    • Mutect2 can now emit MNPs according to adjustable distance threshold specified via --max-mnp-distance (#4650)
    • Tweaked Mutect2 read position filter to handle non-biological (eg FFPE) insertions better (#4851)
    • Fixed Mutect2 bug where triallelic normal artifacts were sometimes hidden from filtering engine (#4809)
    • Mutect2 STR filter now also looks at insertions (#4845)
      • This lowers the indel false positive rate dramatically.
    • Mutect2 WDL:
      • now outputs MAF segmentation (#4837)
      • now runs FilterAlignmentArtifacts (#4848)
      • now uses lenient validation in SortSam (#4844)
  • Added new tool FilterAlignmentArtifacts (#4698)

    • Filters false positive alignment artifacts (that is, apparent variants due to reads being mapped to the wrong genomic locus) from a VCF callset by checking variant-supporting reads and their mates.
    • By considering the realignment of the read and its mate, it saves a lot of variants, especially in low-complexity regions, from being filtered as mapping errors.
  • HaplotypeCaller

    • HaplotypeCaller can now emit MNPs according to adjustable distance threshold specified via --max-mnp-distance (#4650)
    • New HaplotypeCaller priors for variants sites and homRef blocks (#4793)
      • Added new --population-callset argument allowing an external panel of variants to be specified to inform the frequency distribution underlying the genotype priors
      • Added new --num-reference-samples-if-no-call argument to control whether to infer (and with what effective strength) that only reference alleles were observed at sites not seen in any panel
      • As a side effect of this change, CalculateGenotypePosteriors now supports indels.
    • GCS/NIO output support for the -bamout argument (#4721)
  • -new-qual in HaplotypeCaller/Mutect2/GenotypeGVCFs no longer counts spanning deletions as support for variant qual (#4801)

  • CNNScoreVariants

    • Performance improvements to the prep of the input tensors in the 2D model (#4735)
    • Bug fix to prevent a crash on the ends of the mitochondrial contig (#4751)
  • GATK Engine

    • Added a new traversal type TwoPassVariantWalker that does two passes over its input variants (#4744)
    • Enable the -L argument to read feature files (such as .bed or .vcf files) from non-local Paths, including GCS buckets (#4854)
    • Added --sites-only-vcf-output argument to the GATK engine to suppress genotype fields when writing VCFs (#4764)
    • Tools that use annotations now use the barclay annotation plugin (#4674)
    • Added new ReadQueryNameComparator (#4731)
    • Automatically schedule temporary resource files for delete on exit (#4616)
  • Spark tools

    • Added support for g.vcf.gz files in Spark. #4274 (#4463)
    • Spark tools can now write SAM files #4295. (#4471)
    • Added a --output-shard-tmp-dir argument to specify the parts directory for un-sharded BAM writing (#4666)
  • MarkDuplicatesSpark

    • Fixed MarkDuplicatesSpark so it handles supplementary reads with unmapped mates properly (#4785)
    • Added a distinction between PCR orientation and Optical Duplicates orientation in MarkDuplicatesSpark (#4752)
    • Fixed serialization crash in MarkDuplicatesSpark (#4778)
    • Fixed queryname partitioning bug where asking for queryname sort would result in reads with the same name being split between partitions (#4765)
    • Changed MarkDuplicatesSpark to sort non-queryname sorted bams before processing to ensure marking is consistent across shards (#4732)
    • Renamed some MarkDuplicatesSpark arguments to follow the "kabob-style" convention (#4715)
    • MarkDuplicatesSpark now uses the Picard OpticalDuplicatesFinder directly (#4750)
    • MarkDuplicatesSpark now uses Picard metrics code directly (#4779)
  • BwaSpark: disable sequence dictionary validation when aligning reads #4131 (#4308)

  • Funcotator

    • Major performance improvements due to added caching and other optimizations (#4740)
    • Various fixes (#4783) (#4817) (#4770)
      • Sanitize special characters when outputting VCF so that VCF validation passes
      • Ordering specified in the header did not match the variants and hg19/b37 - VCF datasources were being inconsistently processed, inducing a lot of missed annotations.
      • Added Funcotator tests for Clinvar and Gencode v28 in hg38, and mixed chr/no-chr GENCODE.
      • Eased restrictions so that Gencode v28 would be recognized as a valid gtf. Future versions of Gencode will not fail just based on the version number and warning will be emitted instead.
      • Refining handling of transcripts with missing sequence info.
      • Refactored UTR VariantClassification handling.
      • Added warning statement when a transcript in the UTR has no sequence info (now is the same behavior as in protein coding regions).
      • Added tests to prevent regression on data source date comparison bug.
      • Fixed DNA Repair Genes getter script.
      • Fixed an issue in COSMIC to make it robust to bad COSMIC data.
      • Gencode no longer crashes when given an indel that starts just before an exon.
      • Fixed the SimpleKeyXsvFuncotationFactory to allow any characters to work as delimiters (including characters used in regular expressions, such as pipes).
      • Modified several methods to allow for negative start positions in preparation for allowing indels that start outside exons.
      • Fixed an issue in 5' UTR processing that would cause variant alleles with length > 1 to throw an exception (fixes issue #4712).
      • Fixed a bug in the version detection for Funcotator data sources that would prevent newer data source versions from being detected as compatible (date comparison error).
    • Gencode data sources now have names preserved from config files. (#4823)
  • GCNV kernel tunings (#4720)

    • Fixed a minor issue in sampling error estimation that could lead to NaN (as a result of division by zero)
    • Introduced separate internal and external admixing rates
    • Introduced two-stage inference for cohort denoising and calling
    • Capped phred-scaled qualities to maximum values permitted by machine precision in order to avoid NaNs and overflows.
    • Took a first step toward tracking and logging parameters during inference, starting with the ELBO history.
  • Validation of sequence dictionaries from multiple BAMs now throws warning instead of exception in CNV workflows. (#4758)

  • SV tools

    • Tweak BWA to allow "gappier" alignments in local assemblies (#4708)
    • Added a new experimental tool named CpxVariantReInterprepterSpark to extract barebone-annotated simple variants from an GATK-SV discovery pipeline produced VCF containing complex variants (#4602)
    • Fix "UnhandledCaseSeen" error in StructuralVariationDiscoveryPipelineSpark (#4677)
  • Added new SingleSequenceReferenceAligner class to align against an on-the-fly single contig reference using Bwa-Mem (#4780)

  • Updates to the conda environment for Python-based tools (#4749)

    • Fix #4741, where newer versions of conda appear to treat relative references in the environment yml as being relative to the yml file instead of relative to the cwd (based on observation).
    • Add a second conda yml file (gatkcondaenv.intel.yml) for environments that use Intel hardware acceleration and the Intel Tensorflow package (based on #4735).
    • Added a gradle task (condaEnvironmentDefinition) to generate the conda yml files from a single template to ensure that all the environment definitions remain in sync. This task also generates the Python package archive.
    • Added a gradle task (localDevCondaEnv) to create or update a local (non-Intel) conda environment. This is a shortcut for use during development when you're iteratively changing/testing Python code and want to update the conda env.
  • Added a new WEX test bam to src/test/resources/large, with a companion target interval list (#4756)

  • Add slightly modified version of GATK3 github issue template (#4796)

  • Updated htsjdk to 2.15.1 (#4830)

4.0.4.0

26 Apr 15:37
Compare
Choose a tag to compare

Highlights of this release include major performance improvements to MarkDuplicatesSpark, better sensitivity and precision in STR (short tandem repeat) contexts for Mutect2, support for a "genotype given alleles" mode in Mutect2, dbSNP support for Funcotator, and several important bug fixes to CombineGVCFs.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes in this release:

  • MarkDuplicatesSpark

    • New, optimized version of the tool with greatly improved performance and scalability (#4656)
    • Note that this tool is still marked as beta, and has a number of known issues. The current version is suitable for evaluation/profiling purposes only.
  • Mutect2 improvements

    • Added a GGA (genotype given alleles) mode activated via the --genotyping-mode GENOTYPE_GIVEN_ALLELES and --alleles arguments (#4601)
    • Better sensitivity and precision in STR (short-tandem repeat) contexts (#4690)
    • New, supported Mutect2 NIO-enabled WDL that works in Firecloud (#4710)
    • Better default AF for M2 tumor-normal mode (#4690)
    • Restored explicit PASS (as opposed to empty) filter in Mutect2 (#4644)
    • Fixed Mutect2 failure for germline resource without AF (#4607)
    • Fixed a bug in the Mutect2 WDL bamout where scatters with overlapping assembly regions failed (#4613)
    • Fixed extra filtering args being deactivated in Mutect2 WDL due to typo
  • CombineGVCFs: several important bug fixes

    • ReferenceConfidenceVariantContextMerger fixes for spanning deletions, and use the correct types for the median calculation. (#4680)
    • Handle trailing reference blocks correctly (#4615)
    • Fix and test for calculating intermediate band interval start locations. (#4681)
  • Funcotator

    • Added dbSNP support via a new VcfFuncotationFactory. (#4593)
    • Fixed the refContext annotation. (#4605)
    • Fixed calculation of GC content to be correct. (#4608)
    • Fixes for HG38 exception and better logging. (#4563)
    • Note: only datasource releases 1.2.20180329 and later will work with this version of Funcotator
  • HaplotypeCaller: Fixed a bug that caused the --comp and --input-prior arguments to not be settable by the user (#4703)

  • CNNScoreVariants: Better numerical consistency between python and java, and transpose bug fix (#4652)

  • CNV Tools

    • A new framework to support automated evaluation of GATK CNV (#4276)
    • Enabled zero eigensamples to be specified for CreateReadCountPanelOfNormals (#4502)
    • Exposed maximum chunk size in CNV panel of normals. (#4528)
    • Changed CNV PoN to filter on equality to interval median percentile. (#4503)
  • SV Tools

    • Breakpoint location and type inference unit (#4562)
    • Scaffold local assemblies (#4589)
    • Use the latest version of fermilite jni (#4622)
    • Update sv scripts to only copy a single bam file and index, and respect project parameter (#4646)
    • Various bug fixes (#4670) (#4623)
  • Added GCS (Google Cloud Storage) output support to the following tools: ApplyBQSR, SplitNCigarReads, ClipReads, LeftAlignIndels, RevertBaseQualityScores, and UnmarkDuplicates (#4695) (#4424)

  • Mark the --disable-tool-default-read-filters argument as advanced, and add a warning to its documentation string (#4671)

    • Many tools do not function correctly without their default read filters turned on, so this argument is intended only for advanced users who know what they're doing!
  • ParallelCopyGCSDirectoryIntoHDFSSpark: allow the tool to take a filename glob to subset files to copy (#4624)

  • Picard: updated to version 2.18.2 (#4676)

4.0.3.0

27 Mar 20:46
Compare
Choose a tag to compare

This release brings a major update to our experimental neural-network-based VariantRecalibrator replacement, initial MAF support in Funcotator, as well as some updates to Mutect2 and the CNV tools.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

Summary of changes in this release:

  • A major update to our experimental neural-network-based suite of variant scoring tools, which will eventually replace the VariantRecalibrator (#4245)

    • The NeuralNetInferenceTool has been renamed to CNNScoreVariants
    • Baseline models are now included in the distribution.
    • Added additional tools to write tensors and to train your own models given a VCF of validated calls, an unfiltered VCF and a confident region: CNNVariantTrain, CNNVariantWriteTensors and FilterVariantTranches
    • Read-level 2D models are now supported via the tensor-type read_tensor argument. 2D models at present are significantly slower than the 1D models.
  • Funcotator:

    • Added prototype support for outputting MAF files (and many bug fixes) (#4472)
  • Mutect2:

    • CalculateContamination emits its segmentation and Mutect2 germline model uses it (#4509)
    • Option to emit (but still filter) all germline sites in Mutect2 (#4522)
    • Made number of samples to put variant site in Mutect2 PON adjustable (#4566)
    • Added Oncotator filtering enabled in Mutect2 WDL. (#4423)
  • CNV tools:

    • Replaced CollectFragmentCounts with CollectReadCounts. (#4564)
    • Allowed use of zero eigensamples in DenoiseReadCounts. (#4411)
    • Changed filtering of normal hets on overlap with copy-ratio intervals in ModelSegments to be consistent with filtering of case hets. (#4510)
    • Updated PostprocessGermlineCNVCalls (segments VCF writing, WDL scripts, unit tests, integration tests) (#4396)
  • Miscellaneous changes:

    • Concordance: added option to analyze contributions of different filters (#4520)
    • Exposed the -pairHMM/--pair-hmm-implementation argument in HaplotypeCaller, which was previously hidden (#4494)
    • Set the default samjdk.compression_level to 2 (was previously 1) (#4547)
    • Upgraded to Spark 2.2.0 (#4314)
    • Changed Spark sharding of queryname-sorted bams to better handle secondary and supplementary reads (#4473)
    • Added logging output to the bam writing step for spark tools (#4501)
    • git-lfs is now required to compile the GATK
    • Added a registry for deprecated/unported tools. (#4505)
    • Updated the Hadoop GCS connector from 1.6.1 to 1.6.3. (#4590)
    • Added a large runtime resource directory to git-lfs, and exposed it to the Docker build. (#4530)
    • We now include full tool documentation in the GATK binary distribution zip (#4377)
    • Made our maven artifacts much smaller by preventing gradle uploadArchives from including distZip and distTar (#4569)
    • Added chr20 and chr21 alt contigs to the GRCh38 reference snippet used for testing (#4548)