Skip to content

Latest commit

 

History

History
67 lines (47 loc) · 3.7 KB

README.md

File metadata and controls

67 lines (47 loc) · 3.7 KB

NIST-GREX (Genome Reference EXploration)

Team: Bharati, Divya, Eddy, Nathan, Rajarshi, Sina, Philippe and Pilar

Giab Benchmarks

Genome in a bottle (GIAB) is a benchmark data set for small variants and structural variant benchmarks. The GIAB Consortium is a public-private academic consortium hosted by the NIST. It develops the technical infrastructure like the reference standards, methods and data, to enable translation of whole human genome sequencing to clinical practice and innovations in technologies. For this GIAB provides a pilot genome based on the HapMap Project and the Personal Genome Project. Genome data is publically available via FTP.

#TODO: Releases evaluated against a variety of call sets Resource development process described in manuscripts and pre-prints

Stratification

In addition to the reference genome, GIAB also provides stratification files in the form of BED files. These files are intended to be used to evaluate the behavior of bioinformatic tools, in for example difficult regions on the human genome, in a standardized way. The stratification files can be divided into seven types: Low Complexity, Functional Technically Difficult, Genome Specific, Functional Regions, GC content, mappability, Other Difficult, Segmental Duplications, Union, Ancestry and XY.

For this project the following types are being investigated in more details for the mappability of prediction onto T2T CHM13 and GRCh38:

Type. Description Number of Stratification
Low Complexity Regions with different types and sizes of low complexity sequence, e.g., homopolymers, STRs, VNTRs and other locally repetitive sequences. GRCh37 (28), GRCh38 (28), CHM13v2.0 (27)
Segmental Duplication Regions with segmental duplications or regions with non-trivial self-chain alignments. GRCh37 (9), GRCh38 (9), CHM13v2.0 (2)
GC Content Regions with different ranges (%) of GC content. GRCh37 (14) and GRCh38 (14)
TR #TODO #TODO

The stratification files can also be downloaded via the GIAB genome stratification FTP.

Flowchart

Flowchart

Tasks

Stratification BED file generation

No. Task Responsible Person Link
S1. Mappability for CHM13v2.0 reference genome Pilar S1
S2. GC content for the CHM13v2 Eddy S2
S3. Gene coding sequence for CHM13v2 Divya S3
S4. Other difficult regions for CHM13v2 Rajarshi S4

Feature engineering

No. Task Responsible Person Link
S5. Tandem Repeat (TR) Feature Bharati, Divya and Nathan S5

Exploratory analysis

No. Task Responsible Person Link
S6. Distribution of variants in Tandem repeats and homopolymer Nathan S6
S7. Distribution of adjacent variants Sina S7
S8. Coverage of variants from a bam file Philippe S8

Implementation details

NIST-Genome Reference EXploration (NIST-GREX) slides