FP8 questions... #242

vgoklani · 2023-04-04T19:24:43Z

vgoklani
Apr 4, 2023

Hey there,

I had some quick questions about the FP8 integration; what type of memory/performance improvements should we expect compared to BF16? I know FP8 has two formats: E4M3 and E5M2; is there an additional overhead for switching between the two?

thanks!

Answered by ptrendx

Apr 6, 2023

The activations that are in FP8 are 2x smaller, yes, although how much it impacts the end-to-end memory usage depends on the network. In the large language model case it is typically the weights and the optimizer state that take most of the memory.
As for the weights, the answer is more complicated and really depends on the usage.

For inference:
- In the ideal case (either by using "standalone" inference solution like Faster Transformer or once frameworks start natively supporting FP8 as a type) the 2x reduction in memory usage should mostly hold.
- In the current TE case, since FP8 is not yet supported as a native type, the network still has the parameters in the framework-native type (lik…

View full answer

ptrendx · 2023-04-04T20:20:29Z

ptrendx
Apr 4, 2023
Maintainer

The peak FLOPs for FP8 operations is 2x the BF16, and for (especially larger) compute-bound operations, like GEMMs, you should see a speedup that is close to that. The full training of the language models has many more operations (like optimizer and GPU to GPU communication), as well as some overhead from casting between section needing BF16 and FP8, which limits the end-to-end speedup. On H100 with NeMo, for models of >1B, we saw 30-40% speedup end-to-end. We are working on more optimizations to reduce the time of those pieces that do not run in BF16 (see e.g. #118 which hides some of the FP8-specific communication) in order to increase that relative speedup.

As for your second question - you would use the 2 formats in different places in your network and I do not think you would ever want to cast between the two (which would be a pass through memory). If your question is about the GEMM (as in, is there a performance difference when performing a GEMM with E4M3 and E5M2) - no, there is no performance impact of using one format or the other.

0 replies

vgoklani · 2023-04-04T23:00:40Z

vgoklani
Apr 4, 2023
Author

thank you @ptrendx - what about memory usage? Should we expect a significant gain in GPU ram, which would allow us to run twice the batch-size in one forward pass?

0 replies

ptrendx · 2023-04-06T20:26:22Z

ptrendx
Apr 6, 2023
Maintainer

The activations that are in FP8 are 2x smaller, yes, although how much it impacts the end-to-end memory usage depends on the network. In the large language model case it is typically the weights and the optimizer state that take most of the memory.
As for the weights, the answer is more complicated and really depends on the usage.

For inference:
- In the ideal case (either by using "standalone" inference solution like Faster Transformer or once frameworks start natively supporting FP8 as a type) the 2x reduction in memory usage should mostly hold.
- In the current TE case, since FP8 is not yet supported as a native type, the network still has the parameters in the framework-native type (like FP16 or BF16) and FP8 is generated on the fly, resulting in no real memory gain. One could "hack" the API a little bit and, using the is_first_microbatch argument of TE's modules, cache the FP8 copies of the parameters (using very small batch size) and then delete the higher precision copies, which would result in memory saving.
For training:
- The typical way of training LLMs in reduced precision is to either use AMP or cast the model to the lower precision and keep the FP32 master weights in the optimizer. AMP usecase should automatically use less memory, as the cast would be done to FP8, rather than FP16/BF16.
- The second case (cast + master weights) is slightly more complicated due to distributed optimizer. Without it, I would say that the best way with FP8 would be to actually not cast the model (or at least not the layers intended to be run in FP8). That way, instead of 6 bytes (2B model weights + 4B master weights) you would have 5 bytes (1B FP8 weight + 4B model weights). The distributed optimizer complicates the matter though, as the ZeRO-1 shards the optimizer state (including master weights) across data-parallel group of GPUs. This approach makes the master weights preferable and deny any memory gains from FP8. ZeRO-3 goes further and shards also the actual model parameters, which would again help realize the gains.

TLDR is that it is complicated and highly usecase-dependent, so the safer bet is to not rely on the gains on the memory front for now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 questions... #242

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

FP8 questions... #242

vgoklani Apr 4, 2023

Replies: 3 comments

ptrendx Apr 4, 2023 Maintainer

vgoklani Apr 4, 2023 Author

ptrendx Apr 6, 2023 Maintainer

vgoklani
Apr 4, 2023

ptrendx
Apr 4, 2023
Maintainer

vgoklani
Apr 4, 2023
Author

ptrendx
Apr 6, 2023
Maintainer