-
Hey there, I had some quick questions about the FP8 integration; what type of memory/performance improvements should we expect compared to BF16? I know FP8 has two formats: E4M3 and E5M2; is there an additional overhead for switching between the two? thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
The peak FLOPs for FP8 operations is 2x the BF16, and for (especially larger) compute-bound operations, like GEMMs, you should see a speedup that is close to that. The full training of the language models has many more operations (like optimizer and GPU to GPU communication), as well as some overhead from casting between section needing BF16 and FP8, which limits the end-to-end speedup. On H100 with NeMo, for models of >1B, we saw 30-40% speedup end-to-end. We are working on more optimizations to reduce the time of those pieces that do not run in BF16 (see e.g. #118 which hides some of the FP8-specific communication) in order to increase that relative speedup. As for your second question - you would use the 2 formats in different places in your network and I do not think you would ever want to cast between the two (which would be a pass through memory). If your question is about the GEMM (as in, is there a performance difference when performing a GEMM with E4M3 and E5M2) - no, there is no performance impact of using one format or the other. |
Beta Was this translation helpful? Give feedback.
-
thank you @ptrendx - what about memory usage? Should we expect a significant gain in GPU ram, which would allow us to run twice the batch-size in one forward pass? |
Beta Was this translation helpful? Give feedback.
-
The activations that are in FP8 are 2x smaller, yes, although how much it impacts the end-to-end memory usage depends on the network. In the large language model case it is typically the weights and the optimizer state that take most of the memory.
TLDR is that it is complicated and highly usecase-dependent, so the safer bet is to not rely on the gains on the memory front for now. |
Beta Was this translation helpful? Give feedback.
The activations that are in FP8 are 2x smaller, yes, although how much it impacts the end-to-end memory usage depends on the network. In the large language model case it is typically the weights and the optimizer state that take most of the memory.
As for the weights, the answer is more complicated and really depends on the usage.