Skip to content

Latest commit

 

History

History
75 lines (74 loc) · 4.21 KB

02.outline.md

File metadata and controls

75 lines (74 loc) · 4.21 KB

Outline

  • Background
    • Scale of current genomic datasets
      • Stats on number of Biobanks
      • Stats on UKBB
      • Stats on growth from HapMap and 1KG for comparison
    • Focus of paper
      • Genotyping array data QC, control flow, and association modeling
      • Scalability problems with single-node, in-memory tools
      • Spark
    • Tutorial
      • Explain necessary GWAS toolkit operations
        • Data management (filtering, merging, etc.)
        • Liftover
        • Summary statistics (HWE, AF, call rates, heterozygosity, etc.)
        • Population stratification (IBS + MDS)
        • Association analysis (logreg, fisher/chisq)
        • Genetic relatedness
      • Introduce Marees 2018 paper and explain how tutorial is re-implemented with other tools
    • Tools
      • Primary: PLINK, Glow, Hail
      • Secondary: dask, bigsnpr, scikit-allel, pysnptools/fastlmm
        • Modin may be worth mentioning, even though out-of-core isn't really supported
    • Datasets
      • HapMap
      • 1KG
      • 3K rice genome
    • Data Formats
      • plink, bgen, parquet, Hail MT, vcf, hdf5/npz/zarr
      • Explain encoding and compression concerns
  • Results
    • Code
      • Figure: flow chart of Marees analysis
        • Explain what the resulting code for this project does
      • Figure: side-by-side comparison of code examples
        • Primary differences:
          • Hail is an API over opaque data structures and implementations
          • Glow is simply a convention for representing genetic data in a Spark Dataset, with accompanying methods
          • PLINK is a gigantic list of parameterized commands for a single-core, single-node CLI
        • Operation differences
          • Operations that are similarly easy with all 3 toolkits:
            • call rate filtering
            • heterozygosity rate filtering
            • HWE filtering
            • AF filtering
          • LD Pruning: non-existent in Glow, very slow in Hail
            • However, Glow does support inline calls to PLINK (albeit very awkwardly since PLINK is not streaming software)
          • Gender imputation: PLINK=automatic, Hail=automatic, Glow=manual
            • PLINK and Hail both use inbreeding coefficient on X chromosome data
            • Glow approach (and Marees 2018) look at homozygosity rate on X instead
          • Liftover
            • Not available in PLINK
            • Hail supports coordinate liftover only (not variant liftover)
              • Requires a chain file for the destination reference genome
            • Glow supports both coordinate liftover and variant liftover
          • PCA for population stratification: simple in PLINK and Hail, non-existent in Glow
      • Usability:
        • Extending and learning algorithms from Glow source code is the easiest of all tools
        • Hail documentation is fairly thorough, though essentially no examples or answers are available outside of that documentation (or the discourse)
        • PLINK examples and documentation are both very extensive and easily found, however few PLINK workflows don't also include the need for some scripting language in the same pipeline for interpreting/visualizing results (those outputs are then often used to parameterize other PLINK commands, as exemplified by the tutorials in the project)
        • There are many ways to do the same thing in Hail and it is difficult to know which method to choose (cf. https://discuss.hail.is/t/issues-with-sample-and-variant-qc-by-group/1286/5)
    • Data
      • Figure: File Format Comparison
        • Show comparison of file sizes by dataset and format
    • Performance
      • Figure: Times for operations by dataset and toolkit
      • Touch on vectorization support in breeze (some ops use jni to LAPACK, but simpler ones like sums do not) as compared to numpy
      • Explain benefits of bitpacking (modify 1KG dask nb to have step without GeneticBitPacking filter and compare to original)
  • Discussion
    • Computational operations needed in GWAS analyses (see "Computational Operations" in notes)
      • This may be a good place to characterize all operations and what matrix functions support them