We utilized H100 GPU systems at JLSE testbeds at ALCF.
module load cuda/12.3.0
source ~/.init_conda_x86.sh
conda create -n vllm python=3.10
conda activate vllm
pip install vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks/
python benchmark_latency.py --batch-size=32 --tensor-parallel-size=1 --input-len=32--output-len=32 --model="meta-llama/Llama-2-7b-hf" --dtype="float16" --trust-remote-code
- Use
benchmark_throughput.py
script provided here to collect throughput measurements in the respectivecsv
file. - Use
benchmark_power.py
script provided here to collect power measurements in the respectivecsv
file.You will need
power_utils.py
file for power metric collectring in the same direcotry as thebenchmark_power.py
- Use provided shell script
run-throughput.sh
in this directory to runbenchmark_throughput.py
for various configurations of input, output lengths and batch sizes.
source run-throughput-bench.sh
- Use provided shell script
run-power.sh
in this directory to runbenchmark_power.py
for various configurations of input, output lengths and batch sizes.
source run-power-bench.sh