You can download summary statistics generated using OAPRS
on The International Genomics of Alzheimer's Project (IGAP)
with sample overlap adjustment of Alzheimer's Disease Neuroimaging Initiative (ADNI) genotypes.
OAPRS
is designed to help and guide adjusting sample overlap bias in building PRS without overfitting.
OAPRS
consists of four main steps: 1.summary information preparation, 2.sample overlap adjustment, 3.PRS construction, and 4.validation using visual diagnostics.
Before the preparation step, GWAS summary statistics using only overlapped individual genotypes from the target data need to be generated using standard GWAS softwares.
OAPRS
assumes target data to be a PLINK binary file format.
For list of functions and detailed options in OAPRS
, Please refer to this manual : OAPRS Manual
data.table, dplyr, ggplot2, RcppArmadillo, Rcpp (>= 1.0.9)
R.utils are necessary for reading gzipped files install.packages('R.utils')
You can install OAPRS
Package from OAPRS github using r-devtools
devtools::install_github('leelabsg/OAPRS')
Example files are found in extdata of OAPRS package repository.
Let's read a partial sample summary statistics on large scale genetic consortium with Check_Sums
.
data_path = system.file("extdata/example",package = "OAPRS")
Create Column names first. By setting the cols variable, OAPRS reformat summary statistics for further use.
cols = c(BETA="beta",Pval="pval",CHR="chrom",POS="pos",REF="ref",ALT="alt",SNP="rsids")
We can designate the summary statistics file path and genome_build, and population. If there is not specified column for sample size, Spcf_n, Spcf_n_case, Spcf_n_ctrl are needed.
cs = Check_Sums(paste0(data_path,'/consortium.ss'),
Genome_Build = "hg37", Pop = "eas",
cols=cols,
Spcf_n=249625,Spcf_n_case = 50466, Spcf_n_ctrl = 199159)
Similarly, we can format summary statistics with target summary.
cols = c(BETA="beta",Pval="pval",CHR="chrom",POS="pos",REF="ref",ALT="alt",SNP="rsids",SE="sebeta")
ts = Check_Sums(paste0(data_path,'/target.ss'),
Genome_Build = "hg37", Pop = "eas",
cols=cols,
Spcf_n=72210,Spcf_n_case = 5083, Spcf_n_ctrl = 62127)
With formatted summary statistics cs and ts, we can build adjusted summary statistics using exclude_overlap
.
adj_ss = exclude_overlap(cs,ts,"adj.txt",phenotype="binary")
In this example, prscs
is applied In order to run prscs, we subset adjusted summary as input format.
library(dplyr)
write.table(adj_ss %>% select(SNP,A1=ALT,A2=REF,BETA=BETA_all,P=P_all),paste0(data_path,"/adj_ss_all.txt"),quote = F, col.names = T, row.names = F)
write.table(adj_ss %>% select(SNP,A1=ALT,A2=REF,BETA=BETA_IVW,P=P_IVW),paste0(data_path,"/adj_ss_IVW.txt"),quote = F, col.names = T, row.names = F)
write.table(adj_ss %>% select(SNP,A1=ALT,A2=REF,BETA=BETA_RZ,P=P_RZ),paste0(data_path,"/adj_ss_RZ.txt"),quote = F, col.names = T, row.names = F)
Here is some example usage for prscs for each adjustment methods.
for i in all IVW RZ
do
PRScs.py \
--ref_dir=snpinfo_1kg_hm3_eas.gz \
--bim_prefix=target \
--sst_file=adj_ss_${i}.txt \
--n_gwas=177415 \
--out_dir=${outdir}/${i}
done
Genome build and population information is essential in variant filtering for visual diagnostics.
lds = Marker_select_ld(adj_ss, Genome_Build="hg37", Pop = "eas")
With result from 4 and prs weights on 3, scores will be evaluated by corresponding sample ID's. You can specify platform for prs weights or specify column names
prs_res_paths=c(paste0(data_path,'/prscs_all_chr_unadj.txt'),
paste0(data_path,'/prscs_all_chr_IVW.txt'),paste0(data_path,'/prscs_all_chr_RZ.txt'))
scr = score_eval(prs_res_paths,lds, target_path = paste0(data_path,"/target"),platform="prscs",
pheno_path = paste0(data_path,"/target.pheno"),ID_col="V2",pheno_col="V3")
For utilizing individual prs, columns with "ref" in the result can be used.
scr %>% select(IID, contains('ref'))
Lets make visual diagnostics plot with generated scores. the legends are set using method_names
option
diagnostic_plt(score=scr,title="test",
Output_Plot_path="~/plot.png",method_names = c("all","IVW","RZ"))