Skip to content

Latest commit

 

History

History
35 lines (27 loc) · 6.07 KB

Combined-VCF-options.md

File metadata and controls

35 lines (27 loc) · 6.07 KB

The following options are useful when producing a combined VCF (during the loading/importing or querying phase) similar to the one produced by the GATK CombineGVCF tool.

Note that the Java interface produces combined VCF records (VariantContext objects) and hence, the following options are applicable when using the Java interface.

The options can be specified in the [[loader JSON file|Importing-VCF-data-into-GenomicsDB#execution-parameters-for-the-import-program]] if the combined VCF is being produced during the load/import phase. Otherwise, these options must be specified in the [[query JSON file|Querying-GenomicsDB#json-configuration-file-for-a-query]].

  • reference_genome : (type:string or list of strings, mandatory): Path to reference genome (indexed FASTA file).

  • vcf_header_filename (type:string or list of strings, optional): Path to a template VCF header file. All lines in this template will be present in the header of the combined VCF(s). This template should NOT contain sample/callset names (i.e. the line starting with #CHR). Contigs and fields present in the vid_mapping_filename file will be added to the combined GVCF, if not present in the template header. If this field is omitted, then a simple header will be produced containing the contigs and fields described in the vid_mapping_filename file.

  • max_diploid_alt_alleles_that_can_be_genotyped (type: int, optional, default: 50): For certain locations, the number of alternate alleles in the combined VCF record can get very large (we have seen ~700 alternate alleles for some sample sets). For such locations, the large size of the VCF record causes the program to consume a massive amount of memory. Additionally, in some cases, the VCF spec is unable to handle such large records (especially when there are fields such as PL whose length depends on the number of genotypes). The parameter helps keep the size of the combined gVCF records in check. If the number of alternate alleles is greater than the value of max_diploid_alt_alleles_that_can_be_genotyped, then fields such as PL are dropped for this VCF record. This fix is identical to the one implemented in GATK CombineGVCFs (including the default value).

  • vcf_output_filename (optional, type:string or list of strings): If producing a combined GVCF, then this parameter specifies the path at which the output VCF will be created. If this parameter is omitted, then the output VCF is printed on stdout.

  • vcf_output_format (type:string, optional, default <empty>): Output format can be one of the following strings: "z[0-9]" (compressed VCF),"b[0-9]" (compressed BCF) or "bu" (uncompressed BCF). If nothing is specified, the default is uncompressed VCF.

  • produce_GT_field (type: boolean, optional, default false): The GT field in the combined VCF records is set to missing to match the output produced by GATK CombineGVCF. By setting produce_GT_field to true, the GT field will be retrieved from TileDB/GenomicsDB.

  • produce_GT_with_min_PL_value_for_spanning_deletions (type: boolean, optional, default false): This flag is applicable only when produce_GT_field is true and only for spanning deletions. By default (or when this flag is set to false), the GT field for spanning deletions in the combined VCF records corresponds to the value stored in TileDB/GenomicsDB for the deletion. The allele indexes may get reassigned since the number of alleles in the spanning deletion may be reduced. For example:

              POS	REF	ALT	GT
    Original: 1000	ATGC	TTGC,A,<NON_REF>	0/2
    Spanning: 1001	T	*,<NON_REF>	0/1  #gets changed to 0/1 since number of alleles is reduced
    

    However, when this flag is set to true, the value of the GT field for spanning deletions corresponds to the genotype with the smallest likelihood value (PL field). Thus, the GT value in the spanning deletion could become 1/1.

    See the discussion in Intel-HLS/GenomicsDB#161 for a detailed example, especially the comments by @ldgauthier.

  • index_output_VCF (type: boolean, optional, default false): If a compressed combined VCF file is being created (see vcf_output_filename and vcf_output_format), setting this parameter to true will create an index for the output file - tabix for compressed VCFs and csi for compressed BCFs.

  • produce_FILTER_field (type: boolean, optional, default false): The FILTER field in the combined VCF records is set to missing to match the output produced by GATK CombineGVCF. By including the FILTER field in the list of queried attributes (or setting scan_full to true) and setting produce_FILTER_field to true, the FILTER field will be retrieved from TileDB/GenomicsDB.

  • sites_only_query (type: boolean, optional, default false): When set to true, GenomicsDB will NOT produce any FORMAT fields in the resulting VCF records (no samples). ID, QUAL, FILTER and INFO fields will be produced.

Performance tuning options:

  • combined_vcf_records_buffer_size_limit (type: integer, optional, default 1048576): This parameter determines the size of the memory buffer (in bytes) to hold the combined VCF records in the following scenarios:
    • If the combined VCF records are being produced in the load/import phase and software pipelining is used to run multiple stages of the loader in parallel. Records are flushed to disk when this buffer is full.
    • If the Java interface is used to produce combined VCF records. Control returns to the Java code when this buffer is full.

INFO and QUAL field combine operations

See [[this section|Importing-VCF-data-into-GenomicsDB#fields-information]] to find out how to specify combine operations for INFO and QUAL fields in the [[vid JSON file|Importing-VCF-data-into-GenomicsDB#information-about-vcfs-for-the-import-program]]. In particular, see the subsection labeled VCF_field_combine_operation.