GESim (V0.1) is a tool to simulate gene expression from genotype/haplotype data:
- Simulations under a null model with no variants having an effect.
- Simulations based on one SNP as the causal genetic architecture.
- Simulations based on an interaction of a SNP pair as the causal genetic architecture.
- Simulations based on an additive impact of a SNP pair as the causal genetic architecture.
- Simulations based on a haplotype stretch within a block as a causal genetic architecture.
This version (V0.1) was quickly tested. Further tests should be carried out soon.
- VCF file should contain biallelic variants (only two alleles).
- VCF file should not have missing genotypes. Missing genotypes will be replaced by homozygous reference variants.
- A warning message will appear if the square of the Pearson correlation coefficient between the SNPs of a pair, any of the pair SNPs and the encoded combined impact (additive/interaction) is greater than 0.8.
- If haplotype-based simulations are required, the VCF file should be phased (
|
separator between alleles).
- R version 3.4.4 (2018-03-15) or later.
- optparse (R libraries).
Rscript GESim.R -i example/variants.vcf -s example/snps.txt --pair_a example/pairs.txt --pair_i example/pairs.txt --hap example/haps.txt --random 0 --h2 0.05 -o example/out/out
Please see the sections below for more details and examples. Parameters and options can be accessed using the help command.
Rscript GESim.R --help
-i
or--vcf
: Haplotype/Genotype file path (.vcf). If simulations based on haplotypes are required, the alleles must be phased and '|' separated.-o
or--out
: Output file path with the prefix of the names of the output files.
-s
or--snp
: SNP file path. It contains one column with the SNP index (the order of the SNP) in the VCF file. See the sample files for the format.--pair_a
: SNP pairs file path to be used for simulations based on the additive impact of a SNP pair. It contains two columns (tab-separated) containing the SNP index (the order of the SNP) in the VCF file. See the sample files for the format.--pair_i
: SNP pairs file path to be used for SNP interaction-based simulations. It contains two columns (tab-separated) containing the SNP index (the order of the SNP) in the VCF file. See the sample files for the format.--hap
: Haplotype file path to be used for haplotype-based simulations. It contains three columns (tab-separated) as follows: the SNP determining the beginning of the haplotype, the SNP determining the end of the haplotype, then the haplotype stretch used for encoding. For example, if you want to simulate gene expression for the haplotype01101
within the block determined by the 5th and ninth SNP in the VCF file, the line in this file should be like this5\t9\t01101
. See the sample files for the format.--h2
: Heritability value between 0 and 1. It refers to the proportion of the expression variation caused by the genetic architecture. Default is 0.05.--random
: Number of simulations with no causal genetic architecture. Default is 0 which means no simulations for this type.
Al Bkhetan, Ziad, et al. "eQTLHap: a tool for comprehensive eQTL analysis considering haplotypic and genotypic effects." Briefings in Bioinformatics (2021).
Copyright 2021 Ziad Al Bkhetan
Licensed under the GNU GENERAL PUBLIC LICENSE (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://github.com/ziadbkh/GESim/blob/main/LICENSE
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
For any help or inquiries, please contact: [email protected]