Add section discussing SAIGE prototype #173

jeromekelleher · 2024-10-24T12:58:08Z

We should compare with VCF, BCF and Savvy (I think SAIGE supports this), using the simulated data we already have. We don't have to go up to 1M samples, just whatever seems reasonable. We can plot the time to run the association test based on traits simulated by tstrait against increasing sample size, as usual.

Will-Tyler · 2025-01-10T02:04:17Z

I ran the SAIGE prototype again today.

Steps

I simulate a phenotype using tstrait.

model = tstrait.trait_model(distribution='normal', mean=0, var=1)
sim_result = tstrait.sim_phenotype(ts=ts, model=model, h2=0.3)

ts is the tree sequence for the simulated data. This produces the following trait:

position  site_id  effect_size causal_allele  allele_freq  trait_id
0  19396110   360834     0.867602             C      0.90283         0

I create PLINK files required for the first step of the single-variant association test.
```
plink2 --vcf ~/Desktop/vcf-zarr-publication/scaling/data/chr21_10_4.vcf.gz --make-bed --out ./input/chr21_10_4 --max-alleles 2
```
The --max-alleles 2 option tells plink2 to ignore multiallelic sites.

We can now run the first step of the single-variant association test.

Rscript step1_fitNULLGLMM.R     \
     --plinkFile=./input/chr21_10_4  \
     --useSparseGRMtoFitNULL=FALSE    \
     --phenoFile=./input/chr21_10_4.phenotypes.txt \
     --phenoCol=phenotype \
     --sampleIDColinphenoFile=sample_id \
     --invNormalize=TRUE     \
     --traitType=quantitative        \
     --outputPrefix=./output/chr21_10_4_quantitative \
     --nThreads=24   \
     --IsOverwriteVarianceRatioFile=TRUE

I run the second step of the single-variant association test with VCF input.

Rscript step2_SPAtests.R        \
     --vcfFile=/Users/willtyler/Desktop/vcf-zarr-publication/scaling/data/chr21_10_4.bcf \
     --vcfFileIndex=/Users/willtyler/Desktop/vcf-zarr-publication/scaling/data/chr21_10_4.bcf.csi \
     --vcfField=GT   \
     --SAIGEOutputFile=./output/chr21_10_4_quantitative_bcf.txt \
     --chrom=1       \
     --minMAF=0 \
     --minMAC=20 \
     --GMMATmodelFile=./output/chr21_10_4_quantitative.rda \
     --varianceRatioFile=./output/chr21_10_4_quantitative.varianceRatio.txt  \
     --is_Firth_beta=TRUE    \
     --pCutoffforFirth=0.05 \
     --is_output_moreDetails=TRUE    \
     --LOCO=FALSE

And with VCZ input.

Rscript step2_SPAtests.R        \
     --vczFile=/Users/willtyler/Desktop/vcf-zarr-publication/scaling/data/chr21_10_4.zarr \
     --vcfField=GT   \
     --SAIGEOutputFile=./output/chr21_10_4_quantitative_vcz.txt \
     --chrom=1       \
     --minMAF=0 \
     --minMAC=20 \
     --GMMATmodelFile=./output/chr21_10_4_quantitative.rda \
     --varianceRatioFile=./output/chr21_10_4_quantitative.varianceRatio.txt  \
     --is_Firth_beta=TRUE    \
     --pCutoffforFirth=0.05 \
     --is_output_moreDetails=TRUE    \
     --LOCO=FALSE

Results

First, I confirm that the numeric columns in the outputs match.

>>> import pandas as pd
>>> vcz = pd.read_csv('output/chr21_10_4_quantitative_vcz.txt', sep='\t')
>>> bcf = pd.read_csv('output/chr21_10_4_quantitative_bcf.txt', sep='\t')
>>> vcz.shape == bcf.shape
True
>>> for column in vcz.columns:
...   print(column, all(vcz[column] == bcf[column]))
... 
CHR True
POS True
MarkerID False
Allele1 False
Allele2 False
AC_Allele2 True
AF_Allele2 True
MissingRate True
BETA True
SE True
Tstat True
var True
p.value True
N True

Second, I look at the locus with the causal allele in SAIGE's output.

bcf[bcf.POS == 19396110]
       CHR       POS  MarkerID Allele1 Allele2  AC_Allele2  AF_Allele2  MissingRate      BETA        SE    Tstat      var   p.value      N
42062    1  19396110    132562       T       C       18173     0.90865            0 -0.022752  0.024487 -37.9448  1667.78  0.352815  10000

Finally, I record the execution time.

For VCF.

   user  system elapsed 
214.751   1.948 217.767

For VCZ.

   user  system elapsed 
375.921  21.119 368.306

Discussion

The high p-value makes me wonder if I am doing something incorrectly... At least the VCZ output matches the VCF output.

I wanted to try using Savvy-formatted input. However, I am not able to install Savvy. Neither the cget method or the conda method is working for me.

The execution time results are different from Will-Tyler/SAIGE#1 (comment). There are several differences between those results and these results. First, I pushed a commit after reporting the first results. Second, I was testing with a binary phenotype whereas here I am testing with a quantitative phenotype. Third, my notes and shell history from running SAIGE the first time some months ago are incomplete but indicate that I used a different strategy to run the first step of the single-variant test.

Hopefully, you can tell me what improvements you would like to see, and I can see what I can do.

Environment details

Conda environment

Apple M1 machine with arm64 architecture, macOS 15.2, and 8192 MiB RAM.

jeromekelleher · 2025-01-10T10:00:38Z

Thanks @Will-Tyler. We're going to need to make this reproducible for the paper, so I think it's worthwhile thinking about how best to package this up into a runnable analysis now.

I guess the simplest thing to do would be to have SAIGE-protoype directory with a Makefile, which clones your repo, and then builds the code locally? I guess we'd need a shell script then to do the actual running of the thing, plus a notebook to document the process?

jeromekelleher mentioned this issue Jan 8, 2025

Resubmission checklist #193

Open

13 tasks

Will-Tyler mentioned this issue Jan 12, 2025

Add notebook to demonstrate SAIGE prototype #200

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add section discussing SAIGE prototype #173

Add section discussing SAIGE prototype #173

jeromekelleher commented Oct 24, 2024 •

edited

Loading

Will-Tyler commented Jan 10, 2025 •

edited

Loading

jeromekelleher commented Jan 10, 2025

Add section discussing SAIGE prototype #173

Add section discussing SAIGE prototype #173

Comments

jeromekelleher commented Oct 24, 2024 • edited Loading

Will-Tyler commented Jan 10, 2025 • edited Loading

Steps

Results

Discussion

Environment details

jeromekelleher commented Jan 10, 2025

jeromekelleher commented Oct 24, 2024 •

edited

Loading

Will-Tyler commented Jan 10, 2025 •

edited

Loading