-
-
Notifications
You must be signed in to change notification settings - Fork 124
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #347 from martinluttap/main
Modified description of project uga genomicsWF
- Loading branch information
Showing
2 changed files
with
11 additions
and
10 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,28 +1,29 @@ | ||
--- | ||
title: Reproducible Performance Benchmark for Genomics Workflows on HPC Cluster | ||
title: Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster | ||
authors: [inkee.kim] | ||
author_notes: ["Assistant Professor of School of Computing, University of Georgia"] | ||
tags: [osre24, reproducibility, genomics, High Performance Computing (HPC), Performance Modeling, Data Analysis, Scientific Workflows] | ||
date: 2024-02-02 | ||
lastmod: 2024-02-02 | ||
lastmod: 2024-02-05 | ||
--- | ||
|
||
**Project Idea description** | ||
|
||
A thorough understanding of workload is a crucial component in designing novel and high-performing scheduling systems. In the context of genomics workflow, the workload often consists of large input files (e.g., tens to hundreds of GB per file) that are processed by a diverse set of applications. Each application has its own resource utilization characteristics, ranging from I/O bound, memory-bound, compute-bound, up to a combination of them. Thus, it is crucial to accurately measure and document these characteristics in order to build a high-performance scheduler. | ||
A thorough understanding of workload is a crucial component in designing high-throughput systems. In the context of genomics workflow, the workloads are often composed of large input files (e.g., tens to hundreds of GB per file) that are processed by a diverse set of applications. Each application has its own resource utilization characteristics, ranging from I/O bound, memory-bound, and compute-bound, up to a combination of them. It is crucial to accurately measure and document these characteristics in order to leverage them for a particular target system. | ||
|
||
Such measurement effort is commonly done using benchmarking tools. However, many existing benchmarks for genomics applications are neither comprehensive nor scalable. Many benchmarks only support a subset of the resources that we need to measure (e.g., only compute and memory, but not I/O). Moreover, we found no benchmarks designed for many concurrent and multi-node executions, even though it is the common setup for production systems. The capability is important because the systems community has long known that a complex system often exhibits unexpected behavior at scale. Benchmarking tools that do not support scalable execution risk providing inaccurate measurement when compared to production systems. | ||
Such measurement effort is commonly conducted through benchmarking. However, many existing studies for benchmarking genomics workflows tend to be either non-comprehensive or outdated, especially given the rapid innovation in the bioinformatics field. Often, these studies measure only a subset of the resources that are needed (e.g., only compute and memory, but not I/O). Moreover, we found there are surprisingly few such studies in the last years, despite various software/hardware advancements that have been made. | ||
|
||
We aim to bridge this gap in understanding genomic workflows by conducting a comprehensive characterization of a broad set of genomics applications. Students will have the opportunity to learn genomics data processing using state-of-the-art applications, workflows, and real-world data. They will collect and package datasets for I/O, memory, and compute utilization using industry-standard tools and best practices. Students will also explore to investigate the correlation of input data size/quality to application execution time. Subsequently, they will analyze the dataset and create a performance model for each application. We will make the generated dataset & analysis publicly accessible, given adequate quality. All experiments will also be conducted in a reproducible manner (e.g., as a Trovi package or Chameleon VM images), and all code will be open-sourced (e.g., shared on a public Github repo). | ||
|
||
In this project, we aim to build a comprehensive and scalable benchmarking tool for genomics workflows on a HPC cluster. Students will be able to learn genomics data processing using state-of-art applications and workflows using real-world data. They will also build and package a resource monitoring system for I/O, memory, and compute utilization using industry-grade tools and best practices. Following that, students will analyze the resource utilization of various applications and create performance models under different resource allocation and colocation degrees. Along the way, students can produce an open-source dataset, which, given sufficient quality, we plan to release to the public. All experiments will be done in a reproducible manner (e.g., as a Trovi package or Chameleon VM image), and all code will be made open-source (e.g., shared on a public Github repo). | ||
|
||
**Project Deliverable** | ||
|
||
A Github repository and/or Chameleon VM image containing source code for application executions & resource monitoring systems. | ||
Jupyter notebooks and/or Trovi artifacts containing analysis & mathematical models of application resource utilization. | ||
A Github repository and/or Chameleon VM image containing source code for application executions & metrics collection. | ||
Jupyter notebooks and/or Trovi artifacts containing analysis and mathematical models for application resource utilization & the effects of data quality. | ||
|
||
|
||
- **Topics:** High Performance Computing (HPC), Performance Modeling, Data Analysis, Scientific Workflows | ||
- **Skills:** Linux, Python, Bash Scripting, Cloud Computing, Data Science Toolkit (e.g. Numpy, Pandas, SciPy) | ||
- **Difficulty:** Medium | ||
- **Size:** 350 hours | ||
- **Skills:** Linux, Python, Bash Scripting, Cloud Computing, Machine Learning, Data Science Toolkit (e.g. Numpy, Pandas, SciPy) | ||
- **Difficulty:** Difficult | ||
- **Size:** Large (350 + 175) hours | ||
- **Mentor(s):** {{% mention inkee.kim %}} |