Skip to content

Conversation

@mike-w-wilson
Copy link
Contributor

No description provided.

- Create annotations directory structure for v5
- Add main frequency calculation script with differential analysis approach
- Support for both gnomAD and All of Us datasets
- Framework for identifying samples to remove due to relatedness/ancestry changes
- Age histogram calculation functionality
- Resource management and pipeline structure
…ry change detection

- Add coverage data integration using AN from coverage computation
- Implement ancestry change detection between v4 and v5
- Add comprehensive documentation in README.md
- Enhance sample identification logic for both gnomAD and All of Us datasets
- Add error handling for missing resources
- Support for ancestry change frequency calculations
- Improve resource management and pipeline structure
…ling and AoU processing

- Refactor process_gnomad_dataset to handle consent sample withdrawals by subtracting frequencies from v4 freq table
- Implement efficient AoU processing using variant_data + all-sites AN approach
- Add comprehensive utility functions and checkpoints for performance
- Integrate FAF, grpmax, and age histogram calculations
- Add group membership resource integration for both gnomAD and AoU datasets
- Include robust error handling and logging throughout pipeline
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only really read through mt_hists_fields, _prepare_consent_vds, and _calculate_consent_frequencies, but happy to review more if helpful. the adjustment order of adj -> sex ploidy adjustment -> homalt hotfix looks like the same as v3's, so that LGTM.

one thing I didn't realize until reviewing this PR is that we didn't adjust the quality histograms between v3 and v4 for the genomes, which makes me think we shouldn't adjust these for v5 either. maybe we should discuss this at a meeting?

)
# For genomes, fixed_homalt_model is always False since we apply v3-style correction to all samples
# (following v3 and v4 genomes approach - no GATK version-based differentiation)
vmt = vmt.annotate_cols(fixed_homalt_model=hl.bool(False))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this script is only going to change the gnomad genomes (and this will always be False), I vote we remove this field and update high_ab_het to no longer expect it

)

vds = hl.vds.VariantDataset(vds.reference_data, vmt)
vds = vds.checkpoint(new_temp_file("consent_samples_vds", "vds"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you add this checkpoint because of the sample filtering above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I guess it should go further down in this function since theres not much work happening before it

vmt = vds.variant_data
vmt = vmt.annotate_rows(v4_af=v4_freq_ht[vmt.row_key].freq[0].AF)

# This follows the v3/v4 genomes workflow for adj and sex adjusted genotypes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this comment also mention that the homalt hotfix is applied after this also for consistency with v3, even though that should actually happen first?

the correct order is homalt hot fix -> adjust sex ploidy -> annotate adj. if I read this right, it looks like we might have done adjust sex ploidy -> annotate adj -> homalt hot fix for v4?

Copy link
Contributor Author

@mike-w-wilson mike-w-wilson Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well in v4 we stored the original _high_ab_het_ref info using het_non_ref which is added before anything else happens.

vmt = vmt.annotate_entries(_het_non_ref=vmt.LGT.is_het_non_ref())

Comment on lines 190 to 201
ab_cutoff = 0.9
ab_expr = vmt.AD[1] / vmt.DP
vmt = vmt.select_entries(
"AD",
"DP",
"GQ",
"_het_non_ref",
"adj",
GT=adjusted_sex_ploidy_expr(vmt.locus, vmt.GT, vmt.sex_karyotype),
_het_ab=ab_expr,
_high_ab_het_ref=(ab_expr > ab_cutoff) & ~vmt._het_non_ref,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove these (het_ab, high_ab_het_ref) if we move the hist code into the coverage/AN PR since hom_alt_depletion_fix will recalculate them

    vmt = vmt.select_entries(
        "AD",
        "DP",
        "GQ",
        "_het_non_ref",
        "adj",
        GT=adjusted_sex_ploidy_expr(vmt.locus, vmt.GT, vmt.sex_karyotype),
    )

consent_freq_ht.AC[i] > 0,
consent_freq_ht.AC[i]
/ hl.float32(866 * 2), # consent_ans_ht[consent_freq_ht.key].AN[i],
0.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to explicitly set AF to 0.0 with this if_else? could you do something like hl.float64(consent_freq_ht.AC[i] / consent_freq_ht.AN[i])?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had this and removed it because I kept seeing missing annotations but it seems that was another issue. Added back

4. Calculating frequencies and age histograms for consent withdrawal samples
5. Subtracting both frequencies and age histograms from v4 frequency HT
6. Only overwriting fields that were actually updated in the final output
7. Computing FAF, grpmax, gen_anc_faf_max, and inbreeding coefficient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also move inbreeding into the coverage/AN code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use AC/AN/hom alt for inbreeding now https://github.com/broadinstitute/gnomad_methods/blob/52a49c8615028b0f9baa5aaf21b8dfd9a30dadae/gnomad/utils/annotations.py#L1046
so I think it can stay here since well have access to all of those

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants