LSEA (Locus Set Enrichment Analysis) is a tool for performing gene set enrichment analysis on independent loci, taking into account LD (Linkage Disequilibrium).
LSEA could be applied for gene set enrichment analysis for data obtained from GWAS-summary statistics files in tsv-format. It is based on simple hypergeometric test, however it transforms genes and gene sets into independant loci and sets of independant loci to eliminate multiple signals from genes in LD to enhance analysis precision.
Tool includes precompiled universe of independant loci based on data, obtained from UK Biobank (https://www.ukbiobank.ac.uk/). Data for all heritable phenotypes (based on partitioned heritability p-value < 0.05) were processed with PLINK to get indepedant loci for each phenotype. After that all files were combined into universe with mearging intervals overlaping more than 60%.
- python (3.7 or higher)
- scipy (1.0.0 or higher)
- pandas (0.17.1 or higher)
- numpy (1.14.1 or higher)
- PLINK (1.07 or higher) - http://zzz.bwh.harvard.edu/plink/
- SnpEff (4.3T or higher) - http://snpeff.sourceforge.net/
To install this tool clone this repository to your PC.
~$ git clone https://github.com/LSEA
Firstly, you need to prepare tsv-file from GWAS summary statistics with the following structure:
CHR | COORDINATE | RSID | REF | ALT | PVAL |
9 | 136058188 | rs12216896 | C | T | 2.89651e-11 |
To launch this tool you will also need to specify path to PLINK and SnpEff directories.
~$ python3 LSEA.py -af <input tsv-file> -sn <path to SNPeff> -pld <path to plink> -bf <bfile for plink> -p
This command will apply LSEA algorithm to the input file and will generate tsv-file with the following structure:
gene_set | p-value | q-value | enrich_description |
BIOCARTA_INTRINSIC_PATHWAY | 2.0446237642438122e-14 | 2.2517441515617103e-10 | (17776, 11, 36, 6, 'F11;FGB;FGA;F5;FGG;KLKB1') |
Note that the genes list could be smaller then the number of common loci, because only indepedant loci are counted for analysis.
-p (--precompiled flag) points that precompiled universe of independant loci based on UK Biobank data is used.
Information about HLA-locus is excluded from analisys due to high ambiguity of LD-scores within the HLA-locus.
-af <input.tsv> Input file in tsv-format
-vf <input.vcf> Annotated vcf-file if it is already prepared
-pl <input.clumped> PlINK result of clumping (.clumped file) if it is already prepared
--precompiled, -p Use precompiled loci
-sn <path to SnpEff directory> Path to SnpEff
-g <genome> Flag for specifying genome for SnpEff annotation
-pld <path to PlINK directory> Path to PlINK
-bf <bfile> Bfile for PLINK
If you don't want to use precompiled universe of independant loci you can use options for creating your own universe based on GWAS summary statictics files. Use -cu (--create_universe) option to create universe of independant loci from your data:
-af <input.tsv> -cu
For this function you have to prepare results of clumping for your GWAS data (obtain .clumped file). If you have multiple files (e.g. for different phenotypes) use -cld flag for specifying directory with clumped files:
-cld <directory with clumped files>
- Anton Shikov - Initial work - anton-shikov
This project is free and available for everyone.