v5 freq ht generation #720

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

mike-w-wilson wants to merge 78 commits into main from mw/v5_freq_calc

Contributor

mike-w-wilson commented Sep 30, 2025

No description provided.

mike-w-wilson added 30 commits

July 24, 2025 13:42


          Add v5 frequency calculation script framework

8c13f9a

- Create annotations directory structure for v5
- Add main frequency calculation script with differential analysis approach
- Support for both gnomAD and All of Us datasets
- Framework for identifying samples to remove due to relatedness/ancestry changes
- Age histogram calculation functionality
- Resource management and pipeline structure


          Enhance v5 frequency calculation with coverage integration and ancest…

aec9318

…ry change detection

- Add coverage data integration using AN from coverage computation
- Implement ancestry change detection between v4 and v5
- Add comprehensive documentation in README.md
- Enhance sample identification logic for both gnomAD and All of Us datasets
- Add error handling for missing resources
- Support for ancestry change frequency calculations
- Improve resource management and pipeline structure


          Add utility functions for coverage integration

705c452


          Rewrite v5 frequency calculation to use v4 frequency HT as base

6f30ff2


          Update documentation to reflect correct v5 frequency calculation appr…

491143d

…oach


          Update v5 frequency calculation to use VDS approach

40d2560


          Fix TODOs and uncomment function calls in v5 frequency script

bbabdb3


          Update frequency script to our style to increase readability

d53cf8e


          Merge branch 'main' into mw/v5_freq_calc

370693b


          Implement v5 frequency generation with gnomAD consent withdrawal hand…

e98ad90

…ling and AoU processing

- Refactor process_gnomad_dataset to handle consent sample withdrawals by subtracting frequencies from v4 freq table
- Implement efficient AoU processing using variant_data + all-sites AN approach
- Add comprehensive utility functions and checkpoints for performance
- Integrate FAF, grpmax, and age histogram calculations
- Add group membership resource integration for both gnomAD and AoU datasets
- Include robust error handling and logging throughout pipeline


          Remove redundant functions, add proper AN calc

305b9a0


          Move to sparse aggs for consent ACs,homalt

f71f81e


          Add resources and remove unused imports and functions, clarify what d…

57d1910

…ataset is being used


          Only do a single pass of the data for freq field aggs


          Drop v5 downsamplings from constants

b2711d6


          Update annotation resources

9652a2e


          Update annotation constants

028f62a


          Correct VDS imported

125da25


          Add remove hard filtered samples false to get around unfound file

63f0549


          Update test args for data test or runtime test

9c4796f


          Update process dataset functions to use utility functions, increasing…

a626890

… readability


          Merge remote-tracking branch 'origin/main' into mw/v5_freq_calc

f7e09c6


          Correct group membership ht call

18aab39


          Correct group membership ht calls

709c462


          Fix filter_partitions call, int to a list

ba767e9


          Properly handle GATK versions in hom alt depletion correction for gno…

23653a1

…mad vs aou


          Mock AN for testing

ba4ba85


          Apply v3 fix to freq hom alt depletion

b31c912


          Update freq calc

c605c0e


          Correct fold in ac hom alt calc

a6b1b91

mike-w-wilson added 16 commits

October 2, 2025 15:19


          Set negatives to 0 to investigate negatives

ff0661f


          Drop meta print

0cecd0a


          Filter to consent variants

8fe908a


          Drop second meta print

780da86


          Copy v3/v4 genomes sex ploidy, adj order

7ef8502


          Add notes for consent hom alt fix approach

3f3c91d


          Reference same vmt for adj, hom alt pass

020b16c


          Fix adj annotation

bc0f4cd


          Filter to common sites before doing any work in freq or hists

bcacd84


          Process hists and freqs together

2fc7b24


          Remove unused functions for AoU freq

c89185b


          Fix genotype call in age hists

11e31d1


          Fix gt call in age hists, second attempt

009e209


          Age histogram does not expect an integer...

7c8cf39


          Remove unused overwrite

267888b


          Updated README to new workflow

321f345

ch-kr reviewed

View reviewed changes

Contributor

ch-kr left a comment

I only really read through mt_hists_fields, _prepare_consent_vds, and _calculate_consent_frequencies, but happy to review more if helpful. the adjustment order of adj -> sex ploidy adjustment -> homalt hotfix looks like the same as v3's, so that LGTM.

one thing I didn't realize until reviewing this PR is that we didn't adjust the quality histograms between v3 and v4 for the genomes, which makes me think we shouldn't adjust these for v5 either. maybe we should discuss this at a meeting?

gnomad_qc/v5/annotations/generate_frequency.py Outdated Show resolved Hide resolved

gnomad_qc/v5/annotations/generate_frequency.py Outdated Show resolved Hide resolved

gnomad_qc/v5/annotations/generate_frequency.py Outdated Show resolved Hide resolved

gnomad_qc/v5/annotations/generate_frequency.py Outdated

+                  )
+                  # For genomes, fixed_homalt_model is always False since we apply v3-style correction to all samples
+                  # (following v3 and v4 genomes approach - no GATK version-based differentiation)
+                  vmt = vmt.annotate_cols(fixed_homalt_model=hl.bool(False))

Contributor

ch-kr Oct 3, 2025

since this script is only going to change the gnomad genomes (and this will always be False), I vote we remove this field and update high_ab_het to no longer expect it

gnomad_qc/v5/annotations/generate_frequency.py

+                  )
+                  vds = hl.vds.VariantDataset(vds.reference_data, vmt)
+                  vds = vds.checkpoint(new_temp_file("consent_samples_vds", "vds"))

Contributor

ch-kr Oct 3, 2025

did you add this checkpoint because of the sample filtering above?

Contributor Author

mike-w-wilson Oct 6, 2025

Yes but I guess it should go further down in this function since theres not much work happening before it

gnomad_qc/v5/annotations/generate_frequency.py Outdated

+                  vmt = vds.variant_data
+                  vmt = vmt.annotate_rows(v4_af=v4_freq_ht[vmt.row_key].freq[0].AF)
+                  # This follows the v3/v4 genomes workflow for adj and sex adjusted genotypes.

Contributor

ch-kr Oct 3, 2025

should this comment also mention that the homalt hotfix is applied after this also for consistency with v3, even though that should actually happen first?

the correct order is homalt hot fix -> adjust sex ploidy -> annotate adj. if I read this right, it looks like we might have done adjust sex ploidy -> annotate adj -> homalt hot fix for v4?

Contributor Author

mike-w-wilson Oct 6, 2025 •

edited

Loading

Well in v4 we stored the original _high_ab_het_ref info using het_non_ref which is added before anything else happens.

gnomad_qc/gnomad_qc/v4/annotations/generate_freq.py

Line 272 in a6e0e7f

vmt = vmt.annotate_entries(_het_non_ref=vmt.LGT.is_het_non_ref())

gnomad_qc/v5/annotations/generate_frequency.py

Comment on lines 190 to 201

+                  ab_cutoff = 0.9
+                  ab_expr = vmt.AD[1] / vmt.DP
+                  vmt = vmt.select_entries(
+                      "AD",
+                      "DP",
+                      "GQ",
+                      "_het_non_ref",
+                      "adj",
+                      GT=adjusted_sex_ploidy_expr(vmt.locus, vmt.GT, vmt.sex_karyotype),
+                      _het_ab=ab_expr,
+                      _high_ab_het_ref=(ab_expr > ab_cutoff) & ~vmt._het_non_ref,
+                  )

Contributor

ch-kr Oct 3, 2025

we can remove these (het_ab, high_ab_het_ref) if we move the hist code into the coverage/AN PR since hom_alt_depletion_fix will recalculate them

    vmt = vmt.select_entries(
        "AD",
        "DP",
        "GQ",
        "_het_non_ref",
        "adj",
        GT=adjusted_sex_ploidy_expr(vmt.locus, vmt.GT, vmt.sex_karyotype),
    )

gnomad_qc/v5/annotations/generate_frequency.py Outdated

+                                  consent_freq_ht.AC[i] > 0,
+                                  consent_freq_ht.AC[i]
+                                  / hl.float32(866 * 2),  # consent_ans_ht[consent_freq_ht.key].AN[i],
+.0,

Contributor

ch-kr Oct 3, 2025

do you need to explicitly set AF to 0.0 with this if_else? could you do something like hl.float64(consent_freq_ht.AC[i] / consent_freq_ht.AN[i])?

Contributor Author

mike-w-wilson Oct 6, 2025

So I had this and removed it because I kept seeing missing annotations but it seems that was another issue. Added back

gnomad_qc/v5/annotations/generate_frequency.py Outdated Show resolved Hide resolved

gnomad_qc/v5/annotations/generate_frequency.py Outdated

+. Calculating frequencies and age histograms for consent withdrawal samples
+. Subtracting both frequencies and age histograms from v4 frequency HT
+. Only overwriting fields that were actually updated in the final output
+. Computing FAF, grpmax, gen_anc_faf_max, and inbreeding coefficient

Contributor

ch-kr Oct 3, 2025

should we also move inbreeding into the coverage/AN code?

Contributor Author

mike-w-wilson Oct 6, 2025

You can just use AC/AN/hom alt for inbreeding now https://github.com/broadinstitute/gnomad_methods/blob/52a49c8615028b0f9baa5aaf21b8dfd9a30dadae/gnomad/utils/annotations.py#L1046
so I think it can stay here since well have access to all of those

mike-w-wilson and others added 8 commits

October 6, 2025 12:36


          Update gnomad_qc/v5/annotations/generate_frequency.py

0269efa

Co-authored-by: Katherine Chao <[email protected]>


          Drop hom_alt_fixed since it is always false

8ca5776


          Testing freq struct

31e757a


          Revert back to regular AF calc for conset AF

c5acca8


          Add freq based calcs back into script and filter freq_ht for tests fi…

43a1e9d

…nal merge


          Move freq_ht filtering up during tests

cfa4a04


          Write out test freq ht to shrink join

b9dc155


          Missed row index

80bcb83

ch-kr mentioned this pull request

Add resource function to read in gnomad genomes v5 VDS #722

Merged

mike-w-wilson added 4 commits

October 8, 2025 15:21


          Naive coalesce updated freq HT for test

b34ed5f


          Remove unnecessary functions and overly complex approaches

4dd62c1


          Merge remote-tracking branch 'origin/main' into mw/v5_freq_calc

d2bb01a


          Update to use v5 vds functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet