You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How can I apply downstream filtering steps (minGQ filtering, FilterOutlierSamples, BatchEffect, and FilterCleanupQualRecalibration) to the cleaned VCF to further control the false discovery rate?
The text was updated successfully, but these errors were encountered:
MinGQ filtering is being deprecated in favor of the FilterGenotypes workflow. We are still working on documentation but here is an outline:
Main steps:
JoinRawCalls
FormatVCFForGatk (if needed)
SVConcordance
FilterGenotypes
Instructions:
Generate inputs and run JoinRawCalls from the main branch to produce a clustered VCF of raw calls for the entire cohort.
Note: the input VCFs here use svtk-style END fields, which need to be modified no matter what version of the pipeline was used, so make sure FormatVcfForGatk.formatter_args is “--fix-end”
If your “cleaned” VCF was generated prior to v0.22-beta, you must run FormatVcfForGatk to produce a GATK-formatted cleaned VCF. If you used a later version of the pipeline, you can skip this step.
IMPORTANT: The "formatter_args" parameter may need to be modified depending on exactly what version of the pipeline was used.
If your cleaned VCF has invalid END tags, you must include the "--fix-end" flag. You can check to see if there are any END fields less than the POS of the record (e.g. for BND records).
If your cleaned VCF has GQ values are scaled to 0-999, you must use the "--scale-down-gq" flag to rescale to 0-99. If you’re unsure, you should be able to tell by looking at the VCF.
Run SVConcordance with the GATK-formatted cleaned VCF as "eval_vcf" and JoinRawCalls VCF as "truth_vcf".
Run FilterGenotypes with the SVConcordance VCF
The “ploidy_table” is generated during JoinRawCalls but wasn’t included as an output for the v0.28.4-beta release. If you kept intermediate files from JoinRawCalls, you can use the output from the CreatePloidyTableFromPed task. Otherwise, you will need to generate the ploidy table manually using this script: https://github.com/broadinstitute/gatk-sv/blob/main/src/sv-pipeline/scripts/ploidy_table_from_ped.py
Recommended default "no_call_rate_cutoff" is 0.05 but can be adjusted
Note these thresholds were generated for the first AoU-SV release, which had relatively little technical noise. You may want to iterate on these thresholds for your dataset until results are satisfactory.
We recommend using de novo rates as a proxy for your FDR. Plots of this are generated in MainVcfQc if you have at least one trio.
7,000-11,000 are typical counts for total (non-BND) SVs per genome.
Higher thresholds will increase precision.
For larger call sets, you can increase the amount of sharding with RecalibrateGq.recalibrate_records_per_shard and filter_vcf_records_per_shard depending on which parts are slow. Defaults for both are 20,000. It is recommended to try to keep the total number of shards under 1000.
How can I apply downstream filtering steps (minGQ filtering, FilterOutlierSamples, BatchEffect, and FilterCleanupQualRecalibration) to the cleaned VCF to further control the false discovery rate?
The text was updated successfully, but these errors were encountered: