- Background
- Scale of current genomic datasets
- Stats on number of Biobanks
- Stats on UKBB
- Stats on growth from HapMap and 1KG for comparison
- Focus of paper
- Genotyping array data QC, control flow, and association modeling
- Scalability problems with single-node, in-memory tools
- Spark
- Tutorial
- Explain necessary GWAS toolkit operations
- Data management (filtering, merging, etc.)
- Liftover
- Summary statistics (HWE, AF, call rates, heterozygosity, etc.)
- Population stratification (IBS + MDS)
- Association analysis (logreg, fisher/chisq)
- Genetic relatedness
- Introduce Marees 2018 paper and explain how tutorial is re-implemented with other tools
- Explain necessary GWAS toolkit operations
- Tools
- Primary: PLINK, Glow, Hail
- Secondary: dask, bigsnpr, scikit-allel, pysnptools/fastlmm
- Modin may be worth mentioning, even though out-of-core isn't really supported
- Datasets
- HapMap
- 1KG
- 3K rice genome
- Data Formats
- plink, bgen, parquet, Hail MT, vcf, hdf5/npz/zarr
- Explain encoding and compression concerns
- Scale of current genomic datasets
- Results
- Code
- Figure: flow chart of Marees analysis
- Explain what the resulting code for this project does
- Figure: side-by-side comparison of code examples
- Primary differences:
- Hail is an API over opaque data structures and implementations
- Glow is simply a convention for representing genetic data in a Spark Dataset, with accompanying methods
- PLINK is a gigantic list of parameterized commands for a single-core, single-node CLI
- Operation differences
- Operations that are similarly easy with all 3 toolkits:
- call rate filtering
- heterozygosity rate filtering
- HWE filtering
- AF filtering
- LD Pruning: non-existent in Glow, very slow in Hail
- However, Glow does support inline calls to PLINK (albeit very awkwardly since PLINK is not streaming software)
- Gender imputation: PLINK=automatic, Hail=automatic, Glow=manual
- PLINK and Hail both use inbreeding coefficient on X chromosome data
- Glow approach (and Marees 2018) look at homozygosity rate on X instead
- Liftover
- Not available in PLINK
- Hail supports coordinate liftover only (not variant liftover)
- Requires a chain file for the destination reference genome
- Glow supports both coordinate liftover and variant liftover
- Requires a chain file and a reference fasta for the destination genome
- see the notebook at https://glow.readthedocs.io/en/latest/etl/lift-over.html for specification of chain and reference files
- PCA for population stratification: simple in PLINK and Hail, non-existent in Glow
- Operations that are similarly easy with all 3 toolkits:
- Primary differences:
- Usability:
- Extending and learning algorithms from Glow source code is the easiest of all tools
- Hail documentation is fairly thorough, though essentially no examples or answers are available outside of that documentation (or the discourse)
- PLINK examples and documentation are both very extensive and easily found, however few PLINK workflows don't also include the need for some scripting language in the same pipeline for interpreting/visualizing results (those outputs are then often used to parameterize other PLINK commands, as exemplified by the tutorials in the project)
- There are many ways to do the same thing in Hail and it is difficult to know which method to choose (cf. https://discuss.hail.is/t/issues-with-sample-and-variant-qc-by-group/1286/5)
- Figure: flow chart of Marees analysis
- Data
- Figure: File Format Comparison
- Show comparison of file sizes by dataset and format
- Figure: File Format Comparison
- Performance
- Figure: Times for operations by dataset and toolkit
- Touch on vectorization support in breeze (some ops use jni to LAPACK, but simpler ones like sums do not) as compared to numpy
- Explain benefits of bitpacking (modify 1KG dask nb to have step without GeneticBitPacking filter and compare to original)
- Code
- Discussion
- Computational operations needed in GWAS analyses (see "Computational Operations" in notes)
- This may be a good place to characterize all operations and what matrix functions support them
- Computational operations needed in GWAS analyses (see "Computational Operations" in notes)