Support direct quantization for FP8 matmul #3922

wenscarl · 2024-05-14T01:31:11Z

Historically, FP8 matmul quantization followed the pattern of fake quantization, which involved a sequence of operations: quantization -> dequantization -> dot. Here, (de)quantization refers to type casting and the application of scaling factors. The XLA GemmWriter pass was designed to transform this pattern into a custom cublasLt call.

This PR proposes a departure from the historical approach by adopting direct quantization, which is quantization -> dot -> dequantization. This adjustment aligns better with mainstream quantization implementations for other data types. However, the success of this PR hinges on another PR in JAX (PR-21211) because of the mixed fp8 type matmul.
cc @lukaszlew @kaixih

kaixih

Overall, the placement of quant and dequant is a bit confusing and the q and dq ops seems to be included in our custom dot_general function. I am trying to summarize the rationale here:
Basically, q means a pure quantize without amax logics but xxx_q includes both quantization and amax math.

# Our original design:
x(in_qdq), k(in_qdq)->y
dy(out_qdq), x(in_qdq)->dk
dy(out_qdq), k(in_qdq)->dx

# New direct design:
x(in_q), k(in_q)->y(dq)
dy(out_q), x(in_q)->dk(dq??)
dy(out_q), k(in_q)->dx(dq??)

?? indicates the problem about where to place these dq ops. In the original design, we don't need to worry about where the dk and dx are defined, because we don't apply any qdq there. However, in the new design, we need to find them and apply the dq ops explicitly and because we are using jvp mode (forward autograd mode), where we express the grad like:

dy = dx@k + x@dk

So, it seems we have to include the dq ops inside the dot_general function.

So, if that is the case, should we move all the q and dq into the dot_general function, esp in the jvp:

in_q(x)
in_q(y)
y = x@y
dq(y)
dq(dx)
dq(dk)
dy = dx@k + x@dk
in_q(dy)

Also, by doing this, we don't need vjp on the in_q or out_q, since logics are already expressed into the custom jvp dot_general function.

kaixih · 2024-05-14T16:50:53Z

flax/linen/fp8_ops.py

@@ -142,7 +141,7 @@ def qdq_and_return(x, q_dtype, scale, amax_history, compute_dtype):
  amax_from_history = jnp.max(amax_history, axis=0)
  new_scale = compute_scale(amax_from_history, scale, dtype_max)

-  qx = quantize_dequantize(x, q_dtype, new_scale, compute_dtype)


Can we also remove the quantize_dequantize? I think it is no longer used.

It's used in the test file. Removed.

kaixih · 2024-05-14T16:52:15Z

flax/linen/fp8_ops.py

-  q_g, new_scale, new_history = qdq_and_return(
-    g, jnp.float8_e5m2, scale, amax_history, compute_dtype
+  q_g, new_scale, new_history = q_and_return(
+    g, jnp.float8_e5m2, scale, amax_history, compute_dtype #elfie investigate


Is the comment here still relevant? Or can it be more specific as a TODO note?

kaixih · 2024-05-14T16:53:16Z

flax/linen/fp8_ops.py

-      'The function dot_general_with_precision will set the '
-      'precision/preferred_element_type and disregard any provided '
-      'values.'
+    if precision != None or preferred_element_type != None:


I think you accidentally changed the indent here.

kaixih · 2024-05-14T17:44:28Z

flax/linen/fp8_ops.py

+        )
+
+    lhs = quantize(lhs, jnp.float8_e4m3fn, lhs_scale, preferred_element_type)
+    rhs = quantize(rhs, jnp.float8_e4m3fn, rhs_scale, preferred_element_type)


It seems in the forward pass, we directly call the quantize over the lhs and rhs. But do we need the amax computation?

kaixih · 2024-05-14T17:49:01Z

flax/linen/fp8_ops.py

-      self.output_grad_scale.value,
-      self.output_grad_amax_history.value,
-    )
+    y_q = dot_general_with_precision(x, k, dimension_numbers,


I feel it would be better to write the code like:

qx = in_quant(x, ...) # which also includes the amax math qk = in_quant(k, ...) y = dot_general_and_dequant(qx, qk) y = grad_quant(y) # let's call it grad_q since it is to apply quantize over gradients

flax/linen/fp8_ops.py

kaixih · 2024-05-23T18:21:15Z

I think the new design is much clearer of the idea of direct quantization. By the way, do you think we should create a new Fp8DotGeneral op for it and keep the existing fake quant Op untouched? And then we gradually change downstream uses to migrate to the new op?

wenscarl · 2024-05-23T21:05:15Z

I think the new design is much clearer of the idea of direct quantization. By the way, do you think we should create a new Fp8DotGeneral op for it and keep the existing fake quant Op untouched? And then we gradually change downstream uses to migrate to the new op?

Praxis doesn't use Fp8DotGeneralOp directly but dot_general_with_precision. Given that most PAXML models access fp8 from praxis. So instead, I think it makes more sense to keep both fake and direct quant Op there.

flax/linen/fp8_ops.py

tests/linen/linen_test.py

flax/linen/fp8_ops.py

tests/linen/linen_test.py

wenscarl · 2024-06-24T15:52:47Z

A gentle reminder to @lukaszlew

lukaszlew · 2024-06-25T17:55:53Z

Sorry, I don't have cycles to review this PR. I'm focusing on AQT.

review-notebook-app · 2024-07-11T18:11:01Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

zhangqiaorjc · 2024-08-07T17:22:39Z

Could we get @levskaya to help review since Lukasz is busy with something else these days?

wenscarl · 2024-08-14T18:54:56Z

@levskaya could you take a look?

levskaya

Apologies for the long delay - I was visiting rural family who don't have an internet connection when I was pinged here.

levskaya · 2024-08-20T23:35:11Z

flax/linen/fp8_ops.py

+
+  q_g = quantize(g, jnp.float8_e5m2, new_out_grad_scale, preferred_element_type)
+
+  grad_lhs = _dot_general_transpose_lhs(


The JAX team really doesn't like us depending on their internal implementations. Could we inline this function logic here to make this free-standing?

Right, that 's also our main concern back then. Do you mean we should reimplement the logic of the two _xxx functions here?

yeah, they're fairly small functions and you don't need all the generality of them - it's just that JAX may need to change things in the future and we don't want to add external dependencies on their internals.

levskaya · 2024-08-20T23:35:24Z

flax/linen/fp8_ops.py

+    grad_lhs, preferred_element_type, new_rhs_scale * new_out_grad_scale
+  )
+
+  grad_rhs = _dot_general_transpose_rhs(


same comment as above

levskaya · 2024-08-20T23:36:14Z

flax/linen/fp8_ops.py

@@ -25,6 +25,7 @@
 from jax import numpy as jnp
 from jax._src import core
 from jax._src import dtypes
+from jax._src.lax.lax import _dot_general_transpose_lhs, _dot_general_transpose_rhs


better to inline - see below.

levskaya · 2024-08-22T06:18:40Z

Sorry we block if trailing spaces are left in the file, there's some after the line class Fp8Test(parameterized.TestCase): in the tests (you can see the failed precommit) - could you fix and we can run the tests. I think things look ok.

flax/linen/fp8_ops.py

kaixih · 2024-08-29T23:42:33Z

@levskaya I think all the issues have been resolved by @wenscarl . Can you help review and merge?

kaixih · 2024-08-30T17:50:51Z

@levskaya Just resolved some formatting issues. I think all the tests should pass now. Can you help review and merge?

levskaya

one comment below

flax/linen/fp8_ops.py

levskaya · 2024-09-04T17:59:31Z

@wenscarl @kaixih - hey also, I invited you as collaborators to this repo, apologies that I hadn't done that earlier, it should immediately trigger presubmit tests once you join.

levskaya · 2024-09-04T18:09:35Z

@wenscarl - I'm seeing a failed test?

E       UserWarning: The function dot_general_with_precision will set the precision/preferred_element_type and disregard any provided values.

wenscarl · 2024-09-04T19:08:00Z

@levskaya Thanks for reviewing. All checks are passed.

wenscarl changed the title ~~Support direct quantization for FP8 matmul~~ [draft]Support direct quantization for FP8 matmul May 14, 2024

kaixih reviewed May 14, 2024

View reviewed changes

wenscarl requested a review from kaixih May 15, 2024 03:00

wenscarl mentioned this pull request May 22, 2024

Support fp8 direct quantization google/praxis#69

Merged

wenscarl force-pushed the direct_quant branch 2 times, most recently from 3a4a72d to 1097287 Compare May 22, 2024 23:20

kaixih reviewed May 23, 2024

View reviewed changes

flax/linen/fp8_ops.py Outdated Show resolved Hide resolved

wenscarl requested a review from kaixih May 24, 2024 21:40

kaixih reviewed May 29, 2024

View reviewed changes

wenscarl requested a review from kaixih May 30, 2024 02:47

wenscarl force-pushed the direct_quant branch 2 times, most recently from 5e372dd to 2d3e9f5 Compare May 30, 2024 02:52

wenscarl changed the title ~~[draft]Support direct quantization for FP8 matmul~~ Support direct quantization for FP8 matmul May 30, 2024

kaixih reviewed May 30, 2024

View reviewed changes

flax/linen/fp8_ops.py Outdated Show resolved Hide resolved

flax/linen/fp8_ops.py Show resolved Hide resolved

flax/linen/fp8_ops.py Outdated Show resolved Hide resolved

tests/linen/linen_test.py Outdated Show resolved Hide resolved

wenscarl requested a review from kaixih May 30, 2024 18:57

wenscarl force-pushed the direct_quant branch from 8af353d to 374e000 Compare July 11, 2024 18:10

wenscarl force-pushed the direct_quant branch 2 times, most recently from 7e3ee81 to a25f7fc Compare July 11, 2024 19:21

levskaya reviewed Aug 20, 2024

View reviewed changes

wenscarl requested a review from levskaya August 21, 2024 18:03

wenscarl force-pushed the direct_quant branch from 6ad6ea6 to 6e4fc6d Compare August 21, 2024 18:05

kaixih reviewed Aug 22, 2024

View reviewed changes

flax/linen/fp8_ops.py Outdated Show resolved Hide resolved

flax/linen/fp8_ops.py Outdated Show resolved Hide resolved

flax/linen/fp8_ops.py Outdated Show resolved Hide resolved

wenscarl requested a review from kaixih August 22, 2024 18:37

kaixih approved these changes Aug 30, 2024

View reviewed changes

wenscarl force-pushed the direct_quant branch from afa5f6e to 2b7ff5a Compare August 30, 2024 18:03

wenscarl requested a review from kaixih September 3, 2024 19:22

Direct quantization for FP8 Dense Layer.

6f5cee1

wenscarl force-pushed the direct_quant branch from 93b70ed to 6f5cee1 Compare September 3, 2024 20:50

kaixih approved these changes Sep 3, 2024

View reviewed changes

levskaya approved these changes Sep 4, 2024

View reviewed changes

flax/linen/fp8_ops.py Outdated Show resolved Hide resolved

levskaya added the pull ready label Sep 4, 2024

Clean imports

4f5a722

wenscarl added 2 commits September 4, 2024 11:39

Add type: ignore

b4dc094

Remove unnecessary warnings.

e9d19ac

wenscarl requested a review from levskaya September 4, 2024 19:08

copybara-service bot merged commit c44b916 into google:main Sep 4, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support direct quantization for FP8 matmul #3922

Support direct quantization for FP8 matmul #3922

wenscarl commented May 14, 2024

kaixih left a comment •

edited

Loading

kaixih May 14, 2024

wenscarl May 14, 2024

kaixih May 14, 2024

kaixih May 14, 2024

kaixih May 14, 2024

kaixih May 14, 2024

kaixih commented May 23, 2024

wenscarl commented May 23, 2024

wenscarl commented Jun 24, 2024

lukaszlew commented Jun 25, 2024

review-notebook-app bot commented Jul 11, 2024

zhangqiaorjc commented Aug 7, 2024

wenscarl commented Aug 14, 2024

levskaya left a comment

levskaya Aug 20, 2024

kaixih Aug 20, 2024

levskaya Aug 21, 2024

levskaya Aug 20, 2024

levskaya Aug 20, 2024

levskaya commented Aug 22, 2024

kaixih commented Aug 29, 2024

kaixih commented Aug 30, 2024

levskaya left a comment

levskaya commented Sep 4, 2024

levskaya commented Sep 4, 2024

wenscarl commented Sep 4, 2024


		q_g = quantize(g, jnp.float8_e5m2, new_out_grad_scale, preferred_element_type)

		grad_lhs = _dot_general_transpose_lhs(

Support direct quantization for FP8 matmul #3922

Support direct quantization for FP8 matmul #3922

Conversation

wenscarl commented May 14, 2024

kaixih left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaixih commented May 23, 2024

wenscarl commented May 23, 2024

wenscarl commented Jun 24, 2024

lukaszlew commented Jun 25, 2024

review-notebook-app bot commented Jul 11, 2024

zhangqiaorjc commented Aug 7, 2024

wenscarl commented Aug 14, 2024

levskaya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

levskaya commented Aug 22, 2024

kaixih commented Aug 29, 2024

kaixih commented Aug 30, 2024

levskaya left a comment

Choose a reason for hiding this comment

levskaya commented Sep 4, 2024

levskaya commented Sep 4, 2024

wenscarl commented Sep 4, 2024

kaixih left a comment •

edited

Loading