4.beta.1
Pre-releaseThis release brings together most of the tools we intend to include in the final GATK 4.0 release. Some tools are stable and ready for production use, while others are still in a beta or experimental stage of development. You can see which tools are marked as beta/experimental by running gatk-launch --list
A docker image for this release can be found in the broadinstitute/gatk repository on dockerhub. Within the image, cd into /gatk
then run gatk-launch
commands as usual.
Major Known Issues
-
GCS (Google Cloud Storage) inputs/outputs are only supported by a subset of the tools. For the 4.0 general release, we intend to extend support to all tools.
- In particular, GCS support in most of the Spark tools is currently very limited when not running on Google Cloud Dataproc.
- Writing BAMs to a GCS bucket on Spark is broken in some tools due to #2793
-
HaplotypeCaller
andHaplotypeCallerSpark
are still in development and not ready for production use. Their output does not currently match the output of the GATK3 version of the tool in all respects. -
Picard tools bundled with the GATK are currently based off of an older release of Picard. For the 4.0 general release we plan to update to the latest version.
-
CRAM reading can fail with an MD5 mismatch when the reference or reads contain ambiguity codes (#3154)
-
The
IndexFeatureFile
tool is currently disabled due to serious Tabix-index-related bugs in htsjdk (#2801) -
The
GenomicsDBImport
tool (the GATK4 replacement forCombineGVCFs
) experiences transient GCS failures/timeouts when run at massive scale (#2685) -
CNV workflows have been evaluated for use on whole-exome sequencing data, but evaluations for use on whole-genome sequencing data are ongoing. Additional tuning of various parameters (for example, those for
PerformSegmentation
orAllelicCNV
in the somatic workflow) may improve performance or decrease runtime on WGS. -
Creation of a panel of normals with
GermlineCNVCaller
typically requires a Spark cluster. -
The
SV tools
pipeline is under active development and is missing many major features which are planned for its public release. The current pipeline produces deletion, insertion, and inversion calls for a single sample based on local assembly of breakpoints. Known issues and missing features include but are not limited to:- Inversions and breakpoints due to complex events are not properly filtered and annotated in some cases. Some inversion calls produced by the pipeline are due to uncharacterized complex events such as inverted and dispersed duplications. We plan to implement an overhauled, more complete detection system for complex SVs in future releases.
- The SV pipeline does not incorporate read depth based information. We plan to provide integration with read-depth based detection methods in the future, which will increase the number of variants detectable, and assist in the characterization of complex SVs.
- The SV pipeline does not yet genotype variants or provide genotype likelihoods.
- The SV pipeline has only been tested on Spark clusters with a limited set of configurations in Google Cloud Dataproc. We have provided scripts in the test directory for creating and running the pipeline. Running in other configurations may cause problems.