Skip to content

Tags: microsoft/superbenchmark

Tags

v0.11.0

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Docs - Upgrade version and release note (#650)

**Description**

Upgrade version and release note.

**Major Revision**
- Upgrade package versions
- Add release note for v0.11.0

v0.10.0

Release SuperBench v0.10.0

SuperBench 0.10.0 Release Notes
===============================

SuperBench Improvements
-----------------------

- Support monitoring for AMD GPUs.
- Support ROCm 5.7 and ROCm 6.0 dockerfile.
- Add MSCCL support for Nvidia GPU.
- Fix NUMA domains swap issue in NDv4 topology file.
- Add NDv5 topo file.
- Fix NCCL and NCCL-test to 2.18.3 for hang issue in CUDA 12.2.

Micro-benchmark Improvements
----------------------------

- Add HPL random generator to gemm-flops with ROCm.
- Add DirectXGPURenderFPS benchmark to measure the FPS of rendering simple frames.
- Add HWDecoderFPS benchmark to measure the FPS of hardware decoder performance.
- Update Docker image for H100 support.
- Update MLC version into 3.10 for CUDA/ROCm dockerfile.
- Bug fix for GPU Burn test.
- Support INT8 in cublaslt function.
- Add hipBLASLt function benchmark.
- Support cpu-gpu and gpu-cpu in ib-validation.
- Support graph mode in NCCL/RCCL benchmarks for latency metrics.
- Support cpp implementation in distributed inference benchmark.
- Add O2 option for gpu copy ROCm build.
- Support different hipblasLt data types in dist inference.
- Support in-place in NCCL/RCCL benchmark.
- Support data type option in NCCL/RCCL benchmark.
- Improve P2P performance with fine-grained GPU memory in GPU-copy test for AMD GPUs.
- Update hipblaslt GEMM metric unit to tflops.
- Support FP8 for hipblaslt benchmark.

Model Benchmark Improvements
----------------------------

- Change torch.distributed.launch to torchrun.
- Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark.

Result Analysis
---------------

- Support baseline generation from multiple nodes.

v0.9.0

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Docs - Upgrade version and release note (#557)

**Description**
Upgrade version and release note.


**Major Revision**
- Upgrade package versions
- Add release note for v0.9.0

v0.8.0

Release SuperBench v0.8.0

SuperBench v0.8.0 Release Notes
===============================

SuperBench Improvements
-----------------------

- Support SuperBench Executor running on Windows.
- Remove fixed rccl version in rocm5.1.x docker file.
- Upgrade networkx version to fix installation compatibility issue.
- Pin setuptools version to v65.7.0.
- Limit ansible_runner version for Python 3.6.
- Support cgroup V2 when read system metrics in monitor.
- Fix analyzer bug in Python 3.8 due to pandas api change.
- Collect real-time GPU power in monitor.
- Remove unreachable condition when write host list in mpi mode.
- Upgrade Docker image with cuda12.1, nccl 2.17.1-1, hpcx v2.14, and mlc
  3.10.
- Fix wrong unit of cpu-memory-bw-latency in document.

Micro-benchmark Improvements
----------------------------

- Add STREAM benchmark for sustainable memory bandwidth and the
  corresponding computation rate.
- Add HPL Benchmark for HPC Linpack Benchmark.
- Support flexible warmup and non-random data initialization in
  cublas-benchmark.
- Support error tolerance in micro-benchmark for CuDNN function.
- Add distributed inference benchmark.
- Support tensor core precisions (e.g., FP8) and batch/shape range in
  cublaslt gemm.

Model Benchmark Improvements
----------------------------

- Fix torch.dist init issue with multiple models.
- Support TE FP8 in BERT/GPT2 model.
- Add num_workers configurations in model benchmark.

v0.7.0

Release SuperBench v0.7.0

SuperBench v0.7.0 Release Notes
===============================

SuperBench Improvements
-----------------------

- Support non-zero return code when "sb deploy" or "sb run" fails in
  Ansible.
- Support log flushing to the result file during runtime.
- Update version to include revision hash and date.
- Support "pattern" in mpi mode to run tasks in parallel.
- Support topo-aware, all-pair, and K-batch pattern in mpi mode.
- Fix Transformers version to avoid Tensorrt failure.
- Add CUDA11.8 Docker image for NVIDIA arch90 GPUs.
- Support "sb deploy" without pulling image.

Micro-benchmark Improvements
----------------------------

- Support list of custom config string in cudnn-functions and
  cublas-functions.
- Support correctness check in cublas-functions.
- Support GEMM-FLOPS for NVIDIA arch90 GPUs.
- Support cuBLASLt FP16 and FP8 GEMM.
- Add wait time option to resolve mem-bw unstable issue.
- Fix bug for incorrect datatype judgement in cublas-function source
  code.

Model Benchmark Improvements
----------------------------

- Support FP8 in BERT model training.

Distributed Benchmark Improvements
----------------------------------

- Support pair-wise pattern in IB validation benchmark.
- Support topo-aware, pair-wise, and K-batch pattern in nccl-bw
  benchmark.

v0.6.0

Release SuperBench v0.6.0

SuperBench v0.6.0 Release Notes
===============================

SuperBench Improvement
----------------------

- Support running on host directly without Docker.
- Support running `sb` command inside docker image.
- Support ROCm 5.1.1.
- Support ROCm 5.1.3.
- Fix bugs in data diagnosis.
- Fix cmake and build issues.
- Support automatic configuration yaml selection on Azure VM.
- Refine error message when GPU is not detected.
- Add return code for Timeout.
- Update Dockerfile for NCCL/RCCL version, tag name, and verbose output.
- Support node_num=1 in mpi mode.
- Update Python setup for require packages.
- Enhance parameter parsing to allow spaces in value.
- Support NO_COLOR for SuperBench output.

Micro-benchmark Improvements
----------------------------

- Fix issues in ib loopback benchmark.
- Fix stability issue in ib loopback benchmark.

Distributed Benchmark Improvements
----------------------------------

- Enhance pair-wise IB benchmark.
- Bug Fix in IB benchmark.
- Support topology-aware IB benchmark.

Data Diagnosis and Analysis
---------------------------

- Add failure check function in data_diagnosis.py.
- Support JSON and JSONL in Diagnosis.
- Add support to store values of metrics in data diagnosis.
- Support exit code of sb result diagnosis.
- Format int type and unify empty value to N/A in diagnosis output
  files.

v0.6.0-rc1

Pre-release v0.6.0-rc1

Pre-release v0.6.0-rc1.

v0.5.0

Release SuperBench v0.5.0

SuperBench v0.5.0 Release Notes
===============================

Micro-benchmark Improvements
----------------------------

- Support NIC only NCCL bandwidth benchmark on single node in NCCL/RCCL
  bandwidth test.
- Support bi-directional bandwidth benchmark in GPU copy bandwidth test.
- Support data checking in GPU copy bandwidth test.
- Update rccl-tests submodule to fix divide by zero error.
- Add GPU-Burn micro-benchmark.

Model-benchmark Improvements
----------------------------

- Sync results on root rank for e2e model benchmarks in distributed mode.
- Support customized `env` in local and torch.distributed mode.
- Add support for pytorch>=1.9.0.
- Keep BatchNorm as fp32 for pytorch cnn models cast to fp16.
- Remove FP16 samples type converting time.
- Support FAMBench.

Inference Benchmark Improvements
--------------------------------

- Revise the default setting for inference benchmark.
- Add percentile metrics for inference benchmarks.
- Support T4 and A10 in GEMM benchmark.
- Add configuration with inference benchmark.

Other Improvements
------------------

- Add command to support listing all optional parameters for benchmarks.
- Unify benchmark naming convention and support multiple tests with same
  benchmark and different parameters/options in one configuration file.
- Support timeout to detect the benchmark failure and stop the process
  automatically.
- Add rocm5.0 dockerfile.
- Improve output interface.

Data Diagnosis and Analysis
---------------------------

- Support multi-benchmark check.
- Support result summary in md, html and excel formats.
- Support data diagnosis in md and html formats.
- Support result output for all nodes in data diagnosis.

v0.5.0-rc1

Pre-release v0.5.0-rc1

Pre-release v0.5.0-rc1.

v0.4.0

Release SuperBench v0.4.0

SuperBench v0.4.0 Release Notes
===============================

SuperBench Framework
--------------------

__Monitor__

- Add monitor framework for NVIDIA GPU, CPU, memory and disk.

__Data Diagnosis and Analysis__

- Support baseline-based data diagnosis.
- Support basic analysis feature (boxplot figure, outlier detection,
  etc.).

Single-node Validation
----------------------

__Micro Benchmarks__

- CPU Memory Validation (tool: Intel Memory Latency Checker).
- GPU Copy Bandwidth (tool: built by MSRA).
- Add ORT Model on AMD GPU platform.
- Add inference backend TensorRT.
- Add inference backend ORT.

Multi-node Validation
---------------------

__Micro Benchmarks__

- IB Networking validation.
- TCP validation (tool: TCPing).
- GPCNet Validation (tool: GPCNet).

Other Improvement
-----------------

1. Enhancement
   - Add pipeline for AMD docker.
   - Integrate system config info script with SuperBench.
   - Support FP32 mode without TF32.
   - Refine unit test for microbenchmark.
   - Unify metric names for all benchmarks.

2. Document
   - Add benchmark list.
   - Add monitor document.
   - Add data diagnosis document.