Real data analysis in ADNI and IGAP dataset
-
Please check Section 4 in the manuscript for our data pre-processing.
sum_stat.csv
: summary statistics between snp and outcome.gene_exp.csv
: individual data for gene expression, usually obtained by reference panel.snp.csv
: individual data for snps, usually obtained by reference panel.
sum_stat = pd.read_csv(dir_name+"/sum_stat.csv", sep=' ', index_col=0)
gene_exp = -pd.read_csv(dir_name+"/gene_exp.csv", sep=' ', index_col=0)
snp = pd.read_csv(dir_name+"/snp.csv", sep=' ', index_col=0)
- Remove collinear snps
snp, valid_cols = calculate_vif_(snp, thresh=2.5)
sum_stat = sum_stat.loc[valid_cols]
Note that thresh=2.5
is suggested in this ref.
- Convert data to fit our package
n1, n2, p = len(gene_exp), 54162, snp.shape[1]
LD_Z1, cov_ZX1 = np.dot(snp.values.T, snp.values), np.dot(snp.values.T, gene_exp.values.flatten())
LD_Z2, cov_ZY2 = LD_Z1/n1*n2, sum_stat.values.flatten()*n2
Note that we compute LD matrix by the reference panel for BOTH stages 1 and 2 models.
-
Define the method
reg_model
: sparse regression method (stage 2: invalid IVs) you want to use. Here we useLO_IC
, i.e., usingSCAD
to generate candidate models, the usingOLS
for each candidate model and usingBIC
to select the best one.data_in_slice
: number of data in each slice. It is a tuning parameter for sliced inverse regression (SIR), here we usedata_in_slice=0.2*n1
, alternatively, we specifyslice = 5
for SIR.Ks
: range of candidate models. We bound it byint(p/2)-1
to satisfy the identifibility condition for invalid IVs regression.
Ks = range(int(p/2)-1)
reg_model = L0_IC(fit_intercept=False, alphas=10**np.arange(-3,3,.3), Ks=Ks, max_iter=10000, refit=False, find_best=False)
SIR = _2SIR(sparse_reg=reg_model, data_in_slice=0.2*n1)
- Fit the sliced inverse regression for the stage 1 model
- Using the reference panel to fit the stage 1 model.
## Stage-1 fit theta
SIR.fit_theta(Z1=snp.values, X1=gene_exp.values.flatten())
print('\n##### Causal Model of %s #####' %gene_code)
print('-'*20)
print('Estimated 2SIR Stage 1 model: \n theta: \n %s; \n link: \n %s' %(SIR.theta, 'SIR.link'))
##### Causal Model of ACTN4 #####
--------------------
Estimated 2SIR Stage 1 model:
theta:
[0.6257 -0.0177 0.1491 -0.0347 0.0795 -0.0730 -0.2680 0.0991 0.1083 0.3236
-0.0579 -0.1197 -0.0384 -0.0614 -0.0104 -0.0327 0.2072 0.2560 -0.1272
0.2486 -0.2207 -0.0973 0.0885 0.1211 -0.1582 -0.1811 0.1612];
link:
SIR.link
* Save the stage 1 model for **Pre-train** usage.
```python
## save stage 1 theta:
np.save(gene_code+'stage1_theta', SIR.theta)
# This is how to load
# np.load(gene_code+'stage1_theta.npy')
## save stage 1 link function:
import pickle
pickle.dump(SIR.link, gene_code+'stage1_link')
# This is how to load
# SIR.link = pickle.load(gene_code+'stage1_link')
- Fit the sparse regression for the stage 2 model
## Stage-2 fit beta
SIR.fit_beta(LD_Z2, cov_ZY2, n2)
- Once
theta
andbeta
are estimated, we are ready to conduct a hypothesis testing forbeta == 0
.
SIR.test_effect(n2, LD_Z2, cov_ZY2)
print('2SIR beta: %.3f' %SIR.beta)
print('p-value based on 2SIR: %.5f' %SIR.p_value)
- We can construct CI as well, but it is computational expensive.
SIR.CI_beta(n1, n2, Z1=snp.values, X1=gene_exp.values.flatten(),
B_sample=1000,
LD_Z2=LD_Z2, cov_ZY2=cov_ZY2,
boot_over='theta',
level=CI_level)
- Note that there is no need to estimate the nonlinear transformation when conducting hypothesis testing using
nl_causal
, yet if we are still interested in the model, we can further fit the nonlinear link function.
SIR.fit_link(Z1=snp.values, X1=gene_exp.values.flatten())
- After estimated the link function, we can use it to provide an estimation of any
gene_exp
IoR = np.arange(0, 1, 1./100)
link_IoR = SIR.link(X = IoR[:,None])
Then we can summarize the whole estimated models:
print('\n##### Causal Model of %s #####' %gene_code)
print('-'*20)
print('Estimated 2SIR Stage 1 model: \n theta: \n %s; \n link: \n %s' %(SIR.theta, 'SIR.link'))
print('-'*20)
print('Estimated 2SIR Stage 2 model: \n beta: %.3f; \n alpha: \n %s' %(SIR.beta, SIR.alpha))
print('-'*20)
print('p-value for causal inference: %.4f' %(SIR.p_value))
-
Following is the demo outcome for gene
ACTN4
:##### Causal Model of ACTN4 ##### -------------------- Estimated 2SIR Stage 1 model: theta: [0.6257 -0.0177 0.1491 -0.0347 0.0795 -0.0730 -0.2680 0.0991 0.1083 0.3236 -0.0579 -0.1197 -0.0384 -0.0614 -0.0104 -0.0327 0.2072 0.2560 -0.1272 0.2486 -0.2207 -0.0973 0.0885 0.1211 -0.1582 -0.1811 0.1612]; link: SIR.link -------------------- Estimated 2SIR Stage 2 model: beta: 0.009; alpha: [0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000] -------------------- p-value for causal inference: 0.5701