Deterministic execution produces different results for single vs. batch input on both CPU and GPU #26795

jashvira · 2025-02-27T03:47:20Z

Description

When applying a simple MLP with Flax, the same input vector (as a single batch vs. the first row of a larger batch) does not produce bitwise-identical outputs, even when running on CPU and with XLA_FLAGS=--xla_gpu_deterministic_ops=true on GPU. The differences are small (1e-7 level), but they break strict equality checks in my application.

Colab notebook to reproduce.

System info (python version, jaxlib version, accelerator, etc.)

Colab CPU:

jax:    0.4.33
jaxlib: 0.4.33
numpy:  1.26.4
python: 3.11.11 (main, Dec  4 2024, 08:55:07) [GCC 11.4.0]
jax.devices (1 total, 1 local): [CpuDevice(id=0)]
process_count: 1
platform: uname_result(system='Linux', node='79425bdf5d7f', release='6.1.85+', version='#1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024', machine='x86_64')

Colab GPU:

jax:    0.4.33
jaxlib: 0.4.33
numpy:  1.26.4
python: 3.11.11 (main, Dec  4 2024, 08:55:07) [GCC 11.4.0]
jax.devices (1 total, 1 local): [CudaDevice(id=0)]
process_count: 1
platform: uname_result(system='Linux', node='7f58aca50f90', release='6.1.85+', version='#1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024', machine='x86_64')

The text was updated successfully, but these errors were encountered:

justinjfu · 2025-02-27T18:29:43Z

I think there's a few separate issues being discussed here - (a) deterministic results on GPU and (b) identical results for unbatched vs batched computation.

For (a), can you confirm that the GPU results are indeed deterministic? While the batched results are different from the unbatched results, running the batched computation twice should yield the same (but different from unbatched) answer.

For (b), it's possible that different array shapes could change the numerics slightly (e.g. the underlying kernel could pick different hyperparameters that affect the result) - this is even true if you do a matrix multiplication as one single call (A @ B) versus blocked (concat([A1 @ B, A2 @ B])). There isn't really an easy fix here so I would recommend swapping strict equality checks when using FP arithmetic to checks with tolerance.

jashvira · 2025-02-28T13:52:44Z

(a) This is confirmed to be true. Please check the colab to see the test.
(b) The last cell in the colab roughly tests this, and yes, different sized arrays cause problems.

This is frustrating because my application requires reproducible, high-precision results. Without them, it fails entirely. I specifically chose JAX because I believed it supported strict determinism for such numerical computations.

Could we stabilise the length the kernel picks? If that is not possible, what alternatives exist for enforcing strict reproducibility in JAX?

jashvira added the bug Something isn't working label Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic execution produces different results for single vs. batch input on both CPU and GPU #26795

Deterministic execution produces different results for single vs. batch input on both CPU and GPU #26795

jashvira commented Feb 27, 2025 •

edited

Loading

justinjfu commented Feb 27, 2025 •

edited

Loading

jashvira commented Feb 28, 2025

Deterministic execution produces different results for single vs. batch input on both CPU and GPU #26795

Deterministic execution produces different results for single vs. batch input on both CPU and GPU #26795

Comments

jashvira commented Feb 27, 2025 • edited Loading

Description

Colab notebook to reproduce.

System info (python version, jaxlib version, accelerator, etc.)

justinjfu commented Feb 27, 2025 • edited Loading

jashvira commented Feb 28, 2025

jashvira commented Feb 27, 2025 •

edited

Loading

justinjfu commented Feb 27, 2025 •

edited

Loading