`XLA_DISABLE_FUNCTIONALIZATION=0` with ZeRO-1 diverges for Mistral on NxD

It seems that the loss is not converging or that we OOM depending on the `XLA_DISABLE_FUNCTIONALIZATION` flag and ZeRO-1.

### System info
```
aws-neuronx-runtime-discovery==2.9
libneuronxla==2.0.2335
neuronx-cc==2.14.213.0+013d129b
neuronx-distributed==0.8.0
torch==2.1.2
torch-neuronx==2.1.2.2.2.0
torch-xla==2.1.3
torchvision==0.16.2
```
I ran the same training job with 4 settings: `XLA_DISABLE_FUNCTIONALIZATION = 0 | 1` and ZeRO-1 enabled / disabled:

### `XLA_DISABLE_FUNCTIONALIZATION=0` and ZeRO-1

In this case the loss is diverging.

<img width="1512" alt="Capture d’écran 2024-07-17 à 15 45 51" src="https://github.com/user-attachments/assets/874be453-8ff0-4e44-aade-3d3e4b80e3b0">

Note: Since I am using Optimum Neuron, I am not sure if this is my integration of the ZeroRedundancyOptimizer or if it is an actual bug on your end and / or `torch_xla`.

### `XLA_DISABLE_FUNCTIONALIZATION=1` and ZeRO-1

In this case the loss diverges to `inf`.

<img width="1512" alt="Capture d’écran 2024-07-17 à 15 36 27" src="https://github.com/user-attachments/assets/c8b4cab5-5948-4a22-85dd-e0dc110f2f7f">

### `XLA_DISABLE_FUNCTIONALIZATION=0` and regular optimizer

In this case we OOM.

### `XLA_DISABLE_FUNCTIONALIZATION=1` and regular optimizer

The loss converges.

<img width="1512" alt="Capture d’écran 2024-07-17 à 15 15 19" src="https://github.com/user-attachments/assets/6cc51fa0-b463-4d3a-896c-216707db8770">

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`XLA_DISABLE_FUNCTIONALIZATION=0` with ZeRO-1 diverges for Mistral on NxD #26

System info

`XLA_DISABLE_FUNCTIONALIZATION=0` and ZeRO-1

`XLA_DISABLE_FUNCTIONALIZATION=1` and ZeRO-1

`XLA_DISABLE_FUNCTIONALIZATION=0` and regular optimizer

`XLA_DISABLE_FUNCTIONALIZATION=1` and regular optimizer

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

XLA_DISABLE_FUNCTIONALIZATION=0 with ZeRO-1 diverges for Mistral on NxD #26

Description

System info

XLA_DISABLE_FUNCTIONALIZATION=0 and ZeRO-1

XLA_DISABLE_FUNCTIONALIZATION=1 and ZeRO-1

XLA_DISABLE_FUNCTIONALIZATION=0 and regular optimizer

XLA_DISABLE_FUNCTIONALIZATION=1 and regular optimizer

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`XLA_DISABLE_FUNCTIONALIZATION=0` with ZeRO-1 diverges for Mistral on NxD #26

`XLA_DISABLE_FUNCTIONALIZATION=0` and ZeRO-1

`XLA_DISABLE_FUNCTIONALIZATION=1` and ZeRO-1

`XLA_DISABLE_FUNCTIONALIZATION=0` and regular optimizer

`XLA_DISABLE_FUNCTIONALIZATION=1` and regular optimizer