Outline

Background
- Scale of current genomic datasets
  - Stats on number of Biobanks
  - Stats on UKBB
  - Stats on growth from HapMap and 1KG for comparison
- Focus of paper
  - Genotyping array data QC, control flow, and association modeling
  - Scalability problems with single-node, in-memory tools
  - Spark
- Tutorial
  - Explain necessary GWAS toolkit operations
    - Data management (filtering, merging, etc.)
    - Liftover
    - Summary statistics (HWE, AF, call rates, heterozygosity, etc.)
    - Population stratification (IBS + MDS)
    - Association analysis (logreg, fisher/chisq)
    - Genetic relatedness
  - Introduce Marees 2018 paper and explain how tutorial is re-implemented with other tools
- Tools
  - Primary: PLINK, Glow, Hail
  - Secondary: dask, bigsnpr, scikit-allel, pysnptools/fastlmm
    - Modin may be worth mentioning, even though out-of-core isn't really supported
- Datasets
  - HapMap
  - 1KG
  - 3K rice genome
- Data Formats
  - plink, bgen, parquet, Hail MT, vcf, hdf5/npz/zarr
  - Explain encoding and compression concerns
Results
- Code
  - Figure: flow chart of Marees analysis
    - Explain what the resulting code for this project does
  - Figure: side-by-side comparison of code examples
    - Primary differences:
      - Hail is an API over opaque data structures and implementations
      - Glow is simply a convention for representing genetic data in a Spark Dataset, with accompanying methods
      - PLINK is a gigantic list of parameterized commands for a single-core, single-node CLI
    - Operation differences
      - Operations that are similarly easy with all 3 toolkits:
        
        call rate filtering
        
        heterozygosity rate filtering
        
        HWE filtering
        
        AF filtering
      - LD Pruning: non-existent in Glow, very slow in Hail
        
        However, Glow does support inline calls to PLINK (albeit very awkwardly since PLINK is not streaming software)
      - Gender imputation: PLINK=automatic, Hail=automatic, Glow=manual
        
        PLINK and Hail both use inbreeding coefficient on X chromosome data
        
        Glow approach (and Marees 2018) look at homozygosity rate on X instead
      - Liftover
        
        Not available in PLINK
        
        Hail supports coordinate liftover only (not variant liftover)
        
        Requires a chain file for the destination reference genome
        
        Glow supports both coordinate liftover and variant liftover
        
        Requires a chain file and a reference fasta for the destination genome
        
        see the notebook at https://glow.readthedocs.io/en/latest/etl/lift-over.html for specification of chain and reference files
      - PCA for population stratification: simple in PLINK and Hail, non-existent in Glow
  - Usability:
    - Extending and learning algorithms from Glow source code is the easiest of all tools
    - Hail documentation is fairly thorough, though essentially no examples or answers are available outside of that documentation (or the discourse)
    - PLINK examples and documentation are both very extensive and easily found, however few PLINK workflows don't also include the need for some scripting language in the same pipeline for interpreting/visualizing results (those outputs are then often used to parameterize other PLINK commands, as exemplified by the tutorials in the project)
    - There are many ways to do the same thing in Hail and it is difficult to know which method to choose (cf. https://discuss.hail.is/t/issues-with-sample-and-variant-qc-by-group/1286/5)
- Data
  - Figure: File Format Comparison
    - Show comparison of file sizes by dataset and format
- Performance
  - Figure: Times for operations by dataset and toolkit
  - Touch on vectorization support in breeze (some ops use jni to LAPACK, but simpler ones like sums do not) as compared to numpy
  - Explain benefits of bitpacking (modify 1KG dask nb to have step without GeneticBitPacking filter and compare to original)
Discussion
- Computational operations needed in GWAS analyses (see "Computational Operations" in notes)
  - This may be a good place to characterize all operations and what matrix functions support them

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02.outline.md

02.outline.md

Outline

Files

02.outline.md

Latest commit

History

02.outline.md

File metadata and controls

Outline