Questions about the hardware specs for fiddler #7

fangyu29 · 2024-05-09T02:54:50Z

Copying 300MB weights parameters (one expert of mixtral-8x7b) from cpu to gpu requiring 50ms indicates that the PCIe bandwidth is only 0.3GB/50ms = 6GB/s, which is much slower than the reported L4 gpu's PCIe bandwidth (PCIe Gen4 x16 64GB/s) in https://www.nvidia.com/en-us/data-center/l4/ , is there any explanation about it? Thanks.

fangyu29 · 2024-05-09T04:26:46Z

I doubt the reason is the load_state_dict function in

fiddler/benchmarks/microbench.py

Line 27 in 227715b

expert_placeholder.load_state_dict(

is slow, and I have implemented another version of cpu-to-gpu weights copy using torch.copy_ and pin_memory in https://github.com/dingfangyu/fiddler/blob/a6a09cca2c0e95dbcdd39a6e9296e890fd56d4cd/benchmarks/microbench.py#L39

on my 4090 gpu, the profiling results are as below:

1) Weight copy, CPU -> GPU
mean: 13.30 ms, std: 0.01 ms

5) Execution, GPU batch=1
mean: 0.59 ms, std: 0.14 ms

6) Execution, CPU batch=1
mean: 11.41 ms, std: 0.57 ms

the cpu-to-gpu bandwidth is 0.328 / 0.0133 = 24.66 GB/s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the hardware specs for fiddler #7

Questions about the hardware specs for fiddler #7

fangyu29 commented May 9, 2024 •

edited

Loading

fangyu29 commented May 9, 2024

Questions about the hardware specs for fiddler #7

Questions about the hardware specs for fiddler #7

Comments

fangyu29 commented May 9, 2024 • edited Loading

fangyu29 commented May 9, 2024

fangyu29 commented May 9, 2024 •

edited

Loading