Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
8c13f9a
Add v5 frequency calculation script framework
mike-w-wilson Jul 24, 2025
aec9318
Enhance v5 frequency calculation with coverage integration and ancest…
mike-w-wilson Jul 24, 2025
705c452
Add utility functions for coverage integration
mike-w-wilson Jul 24, 2025
6f30ff2
Rewrite v5 frequency calculation to use v4 frequency HT as base
mike-w-wilson Jul 24, 2025
491143d
Update documentation to reflect correct v5 frequency calculation appr…
mike-w-wilson Jul 24, 2025
40d2560
Update v5 frequency calculation to use VDS approach
mike-w-wilson Jul 24, 2025
bbabdb3
Fix TODOs and uncomment function calls in v5 frequency script
mike-w-wilson Jul 24, 2025
d53cf8e
Update frequency script to our style to increase readability
mike-w-wilson Jul 29, 2025
370693b
Merge branch 'main' into mw/v5_freq_calc
mike-w-wilson Sep 18, 2025
e98ad90
Implement v5 frequency generation with gnomAD consent withdrawal hand…
mike-w-wilson Sep 19, 2025
305b9a0
Remove redundant functions, add proper AN calc
mike-w-wilson Sep 24, 2025
f71f81e
Move to sparse aggs for consent ACs,homalt
mike-w-wilson Sep 26, 2025
57d1910
Add resources and remove unused imports and functions, clarify what d…
mike-w-wilson Sep 29, 2025
9775863
Only do a single pass of the data for freq field aggs
mike-w-wilson Sep 29, 2025
b2711d6
Drop v5 downsamplings from constants
mike-w-wilson Sep 30, 2025
9652a2e
Update annotation resources
mike-w-wilson Sep 30, 2025
028f62a
Update annotation constants
mike-w-wilson Sep 30, 2025
125da25
Correct VDS imported
mike-w-wilson Sep 30, 2025
63f0549
Add remove hard filtered samples false to get around unfound file
mike-w-wilson Sep 30, 2025
9c4796f
Update test args for data test or runtime test
mike-w-wilson Sep 30, 2025
a626890
Update process dataset functions to use utility functions, increasing…
mike-w-wilson Oct 1, 2025
f7e09c6
Merge remote-tracking branch 'origin/main' into mw/v5_freq_calc
mike-w-wilson Oct 1, 2025
18aab39
Correct group membership ht call
mike-w-wilson Oct 1, 2025
709c462
Correct group membership ht calls
mike-w-wilson Oct 1, 2025
ba767e9
Fix filter_partitions call, int to a list
mike-w-wilson Oct 1, 2025
23653a1
Properly handle GATK versions in hom alt depletion correction for gno…
mike-w-wilson Oct 1, 2025
ba4ba85
Mock AN for testing
mike-w-wilson Oct 1, 2025
b31c912
Apply v3 fix to freq hom alt depletion
mike-w-wilson Oct 1, 2025
c605c0e
Update freq calc
mike-w-wilson Oct 1, 2025
a6b1b91
Correct fold in ac hom alt calc
mike-w-wilson Oct 1, 2025
d0f027c
Use agg_by_strata for ac and hom alt for efficiency
mike-w-wilson Oct 1, 2025
1616d47
Add adj to agg by strata
mike-w-wilson Oct 1, 2025
9b14851
Use agg_by_strata for aou freq too
mike-w-wilson Oct 1, 2025
3df9611
Add freq meta to consent freq ht
mike-w-wilson Oct 1, 2025
f1ecabe
USe group_membership's freq _meta
mike-w-wilson Oct 1, 2025
a19e461
Use index globals when transfering global anns
mike-w-wilson Oct 1, 2025
e2f5b14
More global anns to index
mike-w-wilson Oct 1, 2025
8a29a1c
int64 -> int32 in freq struct
mike-w-wilson Oct 1, 2025
624393f
int64 -> int32 for hom alt in freq struct
mike-w-wilson Oct 1, 2025
2f2d4e0
reorder freq struct for join
mike-w-wilson Oct 2, 2025
f3d4edb
Spread out global indexing to avoid chain error
mike-w-wilson Oct 2, 2025
69f6a73
More global rearrangement for merging
mike-w-wilson Oct 2, 2025
c68450a
More global rearrangement for merging
mike-w-wilson Oct 2, 2025
245c324
Typo in global declaration
mike-w-wilson Oct 2, 2025
efe795a
Another global attempt
mike-w-wilson Oct 2, 2025
d830564
Another global attempt
mike-w-wilson Oct 2, 2025
e979fd1
Do not create intermediate HT with new expressions
mike-w-wilson Oct 2, 2025
04848fb
Change to select to get around mismatch
mike-w-wilson Oct 2, 2025
43c9dad
Use hl.literal since index_globals appears to not work
mike-w-wilson Oct 2, 2025
ba3562a
Add show for merge testing
mike-w-wilson Oct 2, 2025
ff0661f
Set negatives to 0 to investigate negatives
mike-w-wilson Oct 2, 2025
0cecd0a
Drop meta print
mike-w-wilson Oct 2, 2025
8fe908a
Filter to consent variants
mike-w-wilson Oct 2, 2025
780da86
Drop second meta print
mike-w-wilson Oct 2, 2025
7ef8502
Copy v3/v4 genomes sex ploidy, adj order
mike-w-wilson Oct 3, 2025
3f3c91d
Add notes for consent hom alt fix approach
mike-w-wilson Oct 3, 2025
020b16c
Reference same vmt for adj, hom alt pass
mike-w-wilson Oct 3, 2025
bc0f4cd
Fix adj annotation
mike-w-wilson Oct 3, 2025
bcacd84
Filter to common sites before doing any work in freq or hists
mike-w-wilson Oct 3, 2025
2fc7b24
Process hists and freqs together
mike-w-wilson Oct 3, 2025
c89185b
Remove unused functions for AoU freq
mike-w-wilson Oct 3, 2025
11e31d1
Fix genotype call in age hists
mike-w-wilson Oct 3, 2025
009e209
Fix gt call in age hists, second attempt
mike-w-wilson Oct 3, 2025
7c8cf39
Age histogram does not expect an integer...
mike-w-wilson Oct 3, 2025
267888b
Remove unused overwrite
mike-w-wilson Oct 3, 2025
321f345
Updated README to new workflow
mike-w-wilson Oct 3, 2025
0269efa
Update gnomad_qc/v5/annotations/generate_frequency.py
mike-w-wilson Oct 6, 2025
8ca5776
Drop hom_alt_fixed since it is always false
mike-w-wilson Oct 6, 2025
31e757a
Testing freq struct
mike-w-wilson Oct 6, 2025
c5acca8
Revert back to regular AF calc for conset AF
mike-w-wilson Oct 6, 2025
43a1e9d
Add freq based calcs back into script and filter freq_ht for tests fi…
mike-w-wilson Oct 8, 2025
cfa4a04
Move freq_ht filtering up during tests
mike-w-wilson Oct 8, 2025
b9dc155
Write out test freq ht to shrink join
mike-w-wilson Oct 8, 2025
80bcb83
Missed row index
mike-w-wilson Oct 8, 2025
b34ed5f
Naive coalesce updated freq HT for test
mike-w-wilson Oct 8, 2025
4dd62c1
Remove unnecessary functions and overly complex approaches
mike-w-wilson Oct 10, 2025
d2bb01a
Merge remote-tracking branch 'origin/main' into mw/v5_freq_calc
mike-w-wilson Oct 30, 2025
5720985
Update to use v5 vds functions
mike-w-wilson Oct 30, 2025
87129d1
Calc freq for consent from scratch
mike-w-wilson Nov 10, 2025
85c1129
pop -> gen_anc
mike-w-wilson Nov 10, 2025
47c210e
Access meta fields in v5
mike-w-wilson Nov 10, 2025
a996eed
Fix group membership dataset call
mike-w-wilson Nov 10, 2025
40d47ff
Drop test from group membership call
mike-w-wilson Nov 10, 2025
9f781f9
Add logic around group membership for testing
mike-w-wilson Nov 12, 2025
6e017fd
Fix diff table expression issue
mike-w-wilson Nov 12, 2025
feba3d4
add troubleshooting logs
mike-w-wilson Nov 12, 2025
8d21e52
Add age hists
mike-w-wilson Nov 13, 2025
eaf0b5a
Pass through subset to annotations root
mike-w-wilson Nov 14, 2025
4252e9b
Update to consent_drop_only in gnomad vds call
mike-w-wilson Nov 14, 2025
57d940d
Update to pull in meta
mike-w-wilson Nov 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 146 additions & 0 deletions gnomad_qc/v5/annotations/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# gnomAD v5 Frequency Calculation

This directory contains scripts for calculating variant frequencies and generating age histograms for gnomAD v5.

## Overview

The v5 frequency calculation processes consent withdrawal samples and All of Us (AoU) dataset samples that need to be added to gnomAD v5. The script:

1. **Processes gnomAD consent withdrawal samples** by subtracting their frequencies and age histograms from the v4 frequency HT
2. **Processes All of Us samples** by calculating their frequencies using imported AN values from pre-calculated consent data
3. **Merges both datasets** to create the final v5 frequency table with updated FAF/grpmax annotations

## Key Features

### Consent Withdrawal Processing

The script handles gnomAD samples that need to be removed from v5 due to consent withdrawal:

- **Loads v4 frequency HT** as the base dataset containing both frequencies and age histograms
- **Identifies consent withdrawal samples** using the `consent_samples_to_drop` resource
- **Calculates frequencies and age histograms** for consent samples using corrected genotypes (after hom alt depletion fix and sex ploidy adjustment)
- **Subtracts consent data** from v4 frequencies and age histograms using `merge_freq_arrays` with `operation="diff"`
- **Applies FAF and grpmax annotations** to the updated frequency data

### All of Us Integration

The script processes AoU samples to be added to gnomAD v5:

- **Resolves sample ID collisions** between gnomAD and AoU datasets
- **Filters related samples** that should be excluded
- **Calculates complete frequency structs** using AC/homozygote counts from variant data and AN values imported from `get_consent_ans`
- **Generates age histograms** for AoU samples
- **Uses efficient `agg_by_strata` approach** for frequency calculations

### Quality Corrections Applied

Both datasets undergo consistent quality corrections:

- **Adj filtering** using standard gnomAD quality criteria
- **Hom alt depletion fix** to correct for systematic genotype calling issues
- **Sex ploidy adjustment** for proper handling of X/Y chromosomes
- **Age histogram calculation** using final corrected genotypes to ensure consistency

## Usage

### Process gnomAD Consent Withdrawals

```bash
python gnomad_qc/v5/annotations/generate_frequency.py \
--process-gnomad \
--environment production
```

### Process All of Us Dataset

```bash
python gnomad_qc/v5/annotations/generate_frequency.py \
--process-aou \
--environment production
```

### Merge Both Datasets

```bash
python gnomad_qc/v5/annotations/generate_frequency.py \
--merge-datasets \
--environment production
```

### Testing Mode

```bash
python gnomad_qc/v5/annotations/generate_frequency.py \
--process-gnomad \
--data-test \
--environment development
```

## Output Files

The script generates frequency tables with embedded age histograms:

- **gnomAD frequency table**: v4 frequencies with consent withdrawals removed, includes embedded age histograms
- **AoU frequency table**: Complete frequencies for AoU samples with separate age histogram table
- **Merged frequency table**: Combined dataset with complete frequency data and age histograms

## Methodology

### gnomAD Consent Withdrawal Processing

1. **Sample Identification**: Uses `consent_samples_to_drop` resource
2. **VDS Preparation**: Loads v4 genomes VDS and filters to consent samples
3. **Quality Corrections**: Applies adj filtering, hom alt depletion fix, and sex ploidy adjustment
4. **Frequency Calculation**: Uses `agg_by_strata` for efficient frequency computation
5. **Age Histograms**: Calculated using corrected genotypes to match frequency calculations
6. **Subtraction**: Removes consent data from v4 frequencies using `merge_freq_arrays`
7. **Post-processing**: Adds FAF, grpmax, and inbreeding coefficient annotations

### All of Us Processing

1. **Sample Collision Resolution**: Handles overlapping sample IDs with gnomAD
2. **Relatedness Filtering**: Removes related samples using v5 relatedness criteria
3. **Variant Data Processing**: Calculates AC and homozygote counts from sparse variant data
4. **AN Import**: Uses pre-calculated AN values from `get_consent_ans` (calculated by separate script)
5. **Complete Frequency Struct**: Builds proper frequency arrays with AC, AF, AN, homozygote_count
6. **Age Histograms**: Calculated separately for AoU samples

### Dataset Merging

1. **Frequency Merging**: Combines gnomAD and AoU frequencies using `merge_freq_arrays` with `operation="sum"`
2. **Age Histogram Merging**: Combines age histograms using `merge_histograms` with `operation="sum"`
3. **Global Metadata**: Merges frequency metadata and sample counts from both datasets

## Key Dependencies

- **v4 genomes VDS**: Source data for consent withdrawal processing
- **v4 frequency HT**: Base frequency data with embedded age histograms
- **AoU VDS**: All of Us variant dataset
- **Consent AN data**: Pre-calculated allele numbers from `get_consent_ans`
- **Sample resources**: `consent_samples_to_drop`, `related_samples_to_drop`, `sample_id_collisions`
- **Group membership tables**: For frequency stratification

## Technical Details

### Genotype Processing Order

For proper consistency between frequencies and age histograms:

1. **Adj annotation**: Quality filtering applied first
2. **Hom alt depletion fix**: Corrects systematic calling issues (requires call expression)
3. **Sex ploidy adjustment**: Handles X/Y chromosome ploidy (converts to integer for frequency calc)
4. **Age histogram calculation**: Uses corrected, sex-adjusted genotypes

### Frequency Stratification

- **Sex karyotype**: XX/XY stratification
- **Genetic ancestry**: Population-based stratification
- **Dataset source**: gnomAD vs AoU identification
- **Quality filtering**: adj vs raw call stratification

## Notes

- **AN Calculation**: AoU frequencies use AN values calculated by a separate script and imported via `get_consent_ans`
- **Age Histogram Consistency**: Calculated after all genotype corrections to match frequency calculations
- **Sex Chromosome Handling**: Proper ploidy adjustment ensures correct het/hom classification
- **Memory Efficiency**: Uses sparse matrix operations and checkpointing to handle large datasets
1 change: 1 addition & 0 deletions gnomad_qc/v5/annotations/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# noqa: D104
Loading
Loading