Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add section discussing SAIGE prototype #173

Open
Tracked by #193
jeromekelleher opened this issue Oct 24, 2024 · 2 comments
Open
Tracked by #193

Add section discussing SAIGE prototype #173

jeromekelleher opened this issue Oct 24, 2024 · 2 comments

Comments

@jeromekelleher
Copy link
Contributor

jeromekelleher commented Oct 24, 2024

Based on this code: Will-Tyler/SAIGE#1

We should compare with VCF, BCF and Savvy (I think SAIGE supports this), using the simulated data we already have. We don't have to go up to 1M samples, just whatever seems reasonable. We can plot the time to run the association test based on traits simulated by tstrait against increasing sample size, as usual.

@Will-Tyler
Copy link
Contributor

Will-Tyler commented Jan 10, 2025

I ran the SAIGE prototype again today.

Steps

  1. I simulate a phenotype using tstrait.
    model = tstrait.trait_model(distribution='normal', mean=0, var=1)
    sim_result = tstrait.sim_phenotype(ts=ts, model=model, h2=0.3)
    ts is the tree sequence for the simulated data. This produces the following trait:
    position  site_id  effect_size causal_allele  allele_freq  trait_id
    0  19396110   360834     0.867602             C      0.90283         0
    
  2. I create PLINK files required for the first step of the single-variant association test.
    plink2 --vcf ~/Desktop/vcf-zarr-publication/scaling/data/chr21_10_4.vcf.gz --make-bed --out ./input/chr21_10_4 --max-alleles 2
    
    The --max-alleles 2 option tells plink2 to ignore multiallelic sites.
  3. We can now run the first step of the single-variant association test.
    Rscript step1_fitNULLGLMM.R     \
         --plinkFile=./input/chr21_10_4  \
         --useSparseGRMtoFitNULL=FALSE    \
         --phenoFile=./input/chr21_10_4.phenotypes.txt \
         --phenoCol=phenotype \
         --sampleIDColinphenoFile=sample_id \
         --invNormalize=TRUE     \
         --traitType=quantitative        \
         --outputPrefix=./output/chr21_10_4_quantitative \
         --nThreads=24   \
         --IsOverwriteVarianceRatioFile=TRUE
    
  4. I run the second step of the single-variant association test with VCF input.
    Rscript step2_SPAtests.R        \
         --vcfFile=/Users/willtyler/Desktop/vcf-zarr-publication/scaling/data/chr21_10_4.bcf \
         --vcfFileIndex=/Users/willtyler/Desktop/vcf-zarr-publication/scaling/data/chr21_10_4.bcf.csi \
         --vcfField=GT   \
         --SAIGEOutputFile=./output/chr21_10_4_quantitative_bcf.txt \
         --chrom=1       \
         --minMAF=0 \
         --minMAC=20 \
         --GMMATmodelFile=./output/chr21_10_4_quantitative.rda \
         --varianceRatioFile=./output/chr21_10_4_quantitative.varianceRatio.txt  \
         --is_Firth_beta=TRUE    \
         --pCutoffforFirth=0.05 \
         --is_output_moreDetails=TRUE    \
         --LOCO=FALSE
    
    And with VCZ input.
    Rscript step2_SPAtests.R        \
         --vczFile=/Users/willtyler/Desktop/vcf-zarr-publication/scaling/data/chr21_10_4.zarr \
         --vcfField=GT   \
         --SAIGEOutputFile=./output/chr21_10_4_quantitative_vcz.txt \
         --chrom=1       \
         --minMAF=0 \
         --minMAC=20 \
         --GMMATmodelFile=./output/chr21_10_4_quantitative.rda \
         --varianceRatioFile=./output/chr21_10_4_quantitative.varianceRatio.txt  \
         --is_Firth_beta=TRUE    \
         --pCutoffforFirth=0.05 \
         --is_output_moreDetails=TRUE    \
         --LOCO=FALSE
    

Results

First, I confirm that the numeric columns in the outputs match.

>>> import pandas as pd
>>> vcz = pd.read_csv('output/chr21_10_4_quantitative_vcz.txt', sep='\t')
>>> bcf = pd.read_csv('output/chr21_10_4_quantitative_bcf.txt', sep='\t')
>>> vcz.shape == bcf.shape
True
>>> for column in vcz.columns:
...   print(column, all(vcz[column] == bcf[column]))
... 
CHR True
POS True
MarkerID False
Allele1 False
Allele2 False
AC_Allele2 True
AF_Allele2 True
MissingRate True
BETA True
SE True
Tstat True
var True
p.value True
N True

Second, I look at the locus with the causal allele in SAIGE's output.

bcf[bcf.POS == 19396110]
       CHR       POS  MarkerID Allele1 Allele2  AC_Allele2  AF_Allele2  MissingRate      BETA        SE    Tstat      var   p.value      N
42062    1  19396110    132562       T       C       18173     0.90865            0 -0.022752  0.024487 -37.9448  1667.78  0.352815  10000

Finally, I record the execution time.

For VCF.

   user  system elapsed 
214.751   1.948 217.767

For VCZ.

   user  system elapsed 
375.921  21.119 368.306

Discussion

The high p-value makes me wonder if I am doing something incorrectly... At least the VCZ output matches the VCF output.

I wanted to try using Savvy-formatted input. However, I am not able to install Savvy. Neither the cget method or the conda method is working for me.

The execution time results are different from Will-Tyler/SAIGE#1 (comment). There are several differences between those results and these results. First, I pushed a commit after reporting the first results. Second, I was testing with a binary phenotype whereas here I am testing with a quantitative phenotype. Third, my notes and shell history from running SAIGE the first time some months ago are incomplete but indicate that I used a different strategy to run the first step of the single-variant test.

Hopefully, you can tell me what improvements you would like to see, and I can see what I can do.

Environment details

Conda environment

Apple M1 machine with arm64 architecture, macOS 15.2, and 8192 MiB RAM.

@jeromekelleher
Copy link
Contributor Author

Thanks @Will-Tyler. We're going to need to make this reproducible for the paper, so I think it's worthwhile thinking about how best to package this up into a runnable analysis now.

I guess the simplest thing to do would be to have SAIGE-protoype directory with a Makefile, which clones your repo, and then builds the code locally? I guess we'd need a shell script then to do the actual running of the thing, plus a notebook to document the process?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants