Tags: microsoft/superbenchmark
Tags
Release SuperBench v0.10.0 SuperBench 0.10.0 Release Notes =============================== SuperBench Improvements ----------------------- - Support monitoring for AMD GPUs. - Support ROCm 5.7 and ROCm 6.0 dockerfile. - Add MSCCL support for Nvidia GPU. - Fix NUMA domains swap issue in NDv4 topology file. - Add NDv5 topo file. - Fix NCCL and NCCL-test to 2.18.3 for hang issue in CUDA 12.2. Micro-benchmark Improvements ---------------------------- - Add HPL random generator to gemm-flops with ROCm. - Add DirectXGPURenderFPS benchmark to measure the FPS of rendering simple frames. - Add HWDecoderFPS benchmark to measure the FPS of hardware decoder performance. - Update Docker image for H100 support. - Update MLC version into 3.10 for CUDA/ROCm dockerfile. - Bug fix for GPU Burn test. - Support INT8 in cublaslt function. - Add hipBLASLt function benchmark. - Support cpu-gpu and gpu-cpu in ib-validation. - Support graph mode in NCCL/RCCL benchmarks for latency metrics. - Support cpp implementation in distributed inference benchmark. - Add O2 option for gpu copy ROCm build. - Support different hipblasLt data types in dist inference. - Support in-place in NCCL/RCCL benchmark. - Support data type option in NCCL/RCCL benchmark. - Improve P2P performance with fine-grained GPU memory in GPU-copy test for AMD GPUs. - Update hipblaslt GEMM metric unit to tflops. - Support FP8 for hipblaslt benchmark. Model Benchmark Improvements ---------------------------- - Change torch.distributed.launch to torchrun. - Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark. Result Analysis --------------- - Support baseline generation from multiple nodes.
Release SuperBench v0.8.0 SuperBench v0.8.0 Release Notes =============================== SuperBench Improvements ----------------------- - Support SuperBench Executor running on Windows. - Remove fixed rccl version in rocm5.1.x docker file. - Upgrade networkx version to fix installation compatibility issue. - Pin setuptools version to v65.7.0. - Limit ansible_runner version for Python 3.6. - Support cgroup V2 when read system metrics in monitor. - Fix analyzer bug in Python 3.8 due to pandas api change. - Collect real-time GPU power in monitor. - Remove unreachable condition when write host list in mpi mode. - Upgrade Docker image with cuda12.1, nccl 2.17.1-1, hpcx v2.14, and mlc 3.10. - Fix wrong unit of cpu-memory-bw-latency in document. Micro-benchmark Improvements ---------------------------- - Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate. - Add HPL Benchmark for HPC Linpack Benchmark. - Support flexible warmup and non-random data initialization in cublas-benchmark. - Support error tolerance in micro-benchmark for CuDNN function. - Add distributed inference benchmark. - Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm. Model Benchmark Improvements ---------------------------- - Fix torch.dist init issue with multiple models. - Support TE FP8 in BERT/GPT2 model. - Add num_workers configurations in model benchmark.
Release SuperBench v0.7.0 SuperBench v0.7.0 Release Notes =============================== SuperBench Improvements ----------------------- - Support non-zero return code when "sb deploy" or "sb run" fails in Ansible. - Support log flushing to the result file during runtime. - Update version to include revision hash and date. - Support "pattern" in mpi mode to run tasks in parallel. - Support topo-aware, all-pair, and K-batch pattern in mpi mode. - Fix Transformers version to avoid Tensorrt failure. - Add CUDA11.8 Docker image for NVIDIA arch90 GPUs. - Support "sb deploy" without pulling image. Micro-benchmark Improvements ---------------------------- - Support list of custom config string in cudnn-functions and cublas-functions. - Support correctness check in cublas-functions. - Support GEMM-FLOPS for NVIDIA arch90 GPUs. - Support cuBLASLt FP16 and FP8 GEMM. - Add wait time option to resolve mem-bw unstable issue. - Fix bug for incorrect datatype judgement in cublas-function source code. Model Benchmark Improvements ---------------------------- - Support FP8 in BERT model training. Distributed Benchmark Improvements ---------------------------------- - Support pair-wise pattern in IB validation benchmark. - Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark.
Release SuperBench v0.6.0 SuperBench v0.6.0 Release Notes =============================== SuperBench Improvement ---------------------- - Support running on host directly without Docker. - Support running `sb` command inside docker image. - Support ROCm 5.1.1. - Support ROCm 5.1.3. - Fix bugs in data diagnosis. - Fix cmake and build issues. - Support automatic configuration yaml selection on Azure VM. - Refine error message when GPU is not detected. - Add return code for Timeout. - Update Dockerfile for NCCL/RCCL version, tag name, and verbose output. - Support node_num=1 in mpi mode. - Update Python setup for require packages. - Enhance parameter parsing to allow spaces in value. - Support NO_COLOR for SuperBench output. Micro-benchmark Improvements ---------------------------- - Fix issues in ib loopback benchmark. - Fix stability issue in ib loopback benchmark. Distributed Benchmark Improvements ---------------------------------- - Enhance pair-wise IB benchmark. - Bug Fix in IB benchmark. - Support topology-aware IB benchmark. Data Diagnosis and Analysis --------------------------- - Add failure check function in data_diagnosis.py. - Support JSON and JSONL in Diagnosis. - Add support to store values of metrics in data diagnosis. - Support exit code of sb result diagnosis. - Format int type and unify empty value to N/A in diagnosis output files.
Release SuperBench v0.5.0 SuperBench v0.5.0 Release Notes =============================== Micro-benchmark Improvements ---------------------------- - Support NIC only NCCL bandwidth benchmark on single node in NCCL/RCCL bandwidth test. - Support bi-directional bandwidth benchmark in GPU copy bandwidth test. - Support data checking in GPU copy bandwidth test. - Update rccl-tests submodule to fix divide by zero error. - Add GPU-Burn micro-benchmark. Model-benchmark Improvements ---------------------------- - Sync results on root rank for e2e model benchmarks in distributed mode. - Support customized `env` in local and torch.distributed mode. - Add support for pytorch>=1.9.0. - Keep BatchNorm as fp32 for pytorch cnn models cast to fp16. - Remove FP16 samples type converting time. - Support FAMBench. Inference Benchmark Improvements -------------------------------- - Revise the default setting for inference benchmark. - Add percentile metrics for inference benchmarks. - Support T4 and A10 in GEMM benchmark. - Add configuration with inference benchmark. Other Improvements ------------------ - Add command to support listing all optional parameters for benchmarks. - Unify benchmark naming convention and support multiple tests with same benchmark and different parameters/options in one configuration file. - Support timeout to detect the benchmark failure and stop the process automatically. - Add rocm5.0 dockerfile. - Improve output interface. Data Diagnosis and Analysis --------------------------- - Support multi-benchmark check. - Support result summary in md, html and excel formats. - Support data diagnosis in md and html formats. - Support result output for all nodes in data diagnosis.
Release SuperBench v0.4.0 SuperBench v0.4.0 Release Notes =============================== SuperBench Framework -------------------- __Monitor__ - Add monitor framework for NVIDIA GPU, CPU, memory and disk. __Data Diagnosis and Analysis__ - Support baseline-based data diagnosis. - Support basic analysis feature (boxplot figure, outlier detection, etc.). Single-node Validation ---------------------- __Micro Benchmarks__ - CPU Memory Validation (tool: Intel Memory Latency Checker). - GPU Copy Bandwidth (tool: built by MSRA). - Add ORT Model on AMD GPU platform. - Add inference backend TensorRT. - Add inference backend ORT. Multi-node Validation --------------------- __Micro Benchmarks__ - IB Networking validation. - TCP validation (tool: TCPing). - GPCNet Validation (tool: GPCNet). Other Improvement ----------------- 1. Enhancement - Add pipeline for AMD docker. - Integrate system config info script with SuperBench. - Support FP32 mode without TF32. - Refine unit test for microbenchmark. - Unify metric names for all benchmarks. 2. Document - Add benchmark list. - Add monitor document. - Add data diagnosis document.
PreviousNext