Skip to content

byungsoo-oh/ml-systems-papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 

Repository files navigation

Paper List for Machine Learning Systems

Awesome PRs Welcome

Paper list for broad topics in machine learning systems

NOTE: Survey papers are annotated with [Survey πŸ”] prefix.

Table of Contents

1. Data Processing

1.1 Data pipeline optimization

1.1.1 General

1.1.2 Prep stalls

1.1.3 Fetch stalls (I/O)

1.1.4 Specific workloads (GNN, DLRM)

1.2 Caching and Distributed storage for ML training

1.3 Data formats

  • [ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
  • [VLDB'21] Progressive compressed records: Taking a byte out of deep learning data

1.4 Data pipeline fairness and correctness

  • [CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines

1.5 Data labeling automation

  • [VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision

2. Training System

2.1 Empirical study on ML Jobs

  • [ICSE'24] An Empirical Study on Low GPU Utilization of Deep Learning Jobs
  • [NSDI'24] Characterization of Large Language Model Development in the Datacenter
  • [NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (PAI)
  • [ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (Philly)

2.2 DNN job scheduling

2.3 GPU sharing

2.4 GPU memory management and optimization

2.5 GPU memory usage estimate

  • [ESEC/FSE'20] Estimating GPU memory consumption of deep learning models

2.6 Distributed training (Parallelism)

2.7 DL job failures / Fault tolerance (resilient training)

2.8 AutoML

  • [OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
  • [NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
  • [OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework

2.9 Communication optimization & Network Infrastructure for ML

2.10 DNN compiler

2.11 Model pruning and compression

2.12 GNN training system

For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.

2.13 Congestion control for DNN training

2.14 Others

3. Inference System

4. Mixture of Experts (MoE)

This is the list of papers about MoE training and inference (collected from 2.6 and 3).

5. LLM Long Context

6. Federated Learning

7. Privacy-Preserving ML

8. ML APIs & Application-side Optimization

9. ML (LLM) for Systems

10. GPU kernel scheduling

11 Energy-efficiency for LLM (carbon-aware)

Others

References

This repository is motivated by:

About

Curated collection of papers in machine learning systems

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published