tf.data.experimental.sample_from_datasets non-deterministic in multi-gpu. #53846

yanniskar · 2022-01-21T02:56:23Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Does not apply
TensorFlow installed from (source or binary): pip
TensorFlow version (use command below): 2.4.1
Python version: 3.7
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 11.2
GPU model and memory:

yanniskar · 2022-01-21T02:56:34Z

duncanriach · 2022-01-21T03:59:14Z

@reedwm, bringing your attention to this potential issue for the 2.8 release.

@yanniskar, I see that you're using version 2.4.1; are you able to see if this issue exists in the 2.8.0-rc0 pre-release (e.g. by using the tensorflow/tensorflow:2.8.0rc0-gpu docker image or pip install tensorflow==2.8.0rc0)? This issue may have been fixed since 2.4.

reedwm · 2022-01-21T23:45:50Z

Agree with @duncanriach that you should try on 2.8.0rc0. I'm not sure why this would be nondeterministic on TF 2.4, but it's possible this was fixed since then.

sushreebarsa · 2022-01-22T05:54:09Z

@yanniskar Could you please try with TF v2.8.0rc0 and let us know the outcome? Please refer to the above comments as well. Thanks!

yanniskar · 2022-01-27T04:06:33Z

Thanks for the response @sushreebarsa @duncanriach. I need to use the tensorflow version my org is on. Thus, I cannot upgrade to 2.8 to test this as my org does not support cloud training on that tensorflow version. If it is hard for you guys to verify this in 2.8, I can look into a workaround for testing my code in 2.8. Given other work priorities, it might take me some time before I am able to do this though.

duncanriach · 2022-01-29T01:11:04Z

@yanniskar, if you cannot confirm that this issue is not present in the latest version of TensorFlow, then we cannot either, because we do not have access to the code you're running in order to run it on the latest version ourselves.

Another way to move this issue forward is for you to attempt to provide a self-contained reproducer, a simple piece of code that you can run and also share with us. That way, we can look at exactly what you're looking at, observe it on version 2.4.1, test if it's still present on the latest version, and then be able to potentially debug it. There are too many variables in these systems to be able to debug something that we cannot examine.

The following reproducer code demonstrates the kind of minimal example that could be provided to reproduce the observed multi-device (distributed) nondeterminism. This example distributes the dataset to both the GPU and CPU, with the two-element batch being split into one element per device. For this kind of problem (probably related to dataset distribution between devices), I doubt it matters what kind of devices are used.

The intention should be to recreate the basic configuration as accurately and minimally as possible. For example, it would probably be important to capture the distribution strategy used and when and how the dataset(s) are distributed (such as before or after applying sample_from_datasets).

With determinism, the devices should print the same sequence of values on each run, as they currently do in this example.

The following code can be run and modified in a copy of this colab notebook.

dataset1 = tf.data.Dataset.from_tensor_slices([[10, 11], [12, 13], [14, 15], [16, 17]])
dataset2 = tf.data.Dataset.from_tensor_slices([[21, 22], [23, 24], [25, 26], [27, 28]])
sample_dataset = tf.data.experimental.sample_from_datasets(
  [dataset1, dataset2], weights=[0.5, 0.5], seed=43)

my_strategy = tf.distribute.MirroredStrategy(["GPU:0", "CPU:0"])
with my_strategy.scope():
  @tf.function
  def distribute_train_epoch(dataset):
    for x in dataset:
      my_strategy.run(print, args=(x,))
 
  dist_dataset = my_strategy.experimental_distribute_dataset(sample_dataset)

for _ in range(2):
  print("------------------")
  distribute_train_epoch(dist_dataset)

Output:

------------------
[10]
[11]
[21]
[22]
[23]
[24]
[12]
[13]
[25]
[26]
[14]
[15]
[27]
[28]
[16]
[17]
------------------
[10]
[11]
[21]
[22]
[23]
[24]
[12]
[13]
[25]
[26]
[14]
[15]
[27]
[28]
[16]
[17]

duncanriach · 2022-02-03T04:32:23Z

@yanniskar, are you using XLA (e.g. @tf.function(jit_compile=True))?

kabilan6 · 2022-02-03T19:15:08Z

when the dataset contains 1000 to millions , its non deterministic. we noticed similar problem with tf 2.7 version.

duncanriach · 2022-02-03T21:58:27Z

Hi @kabilan6,

Thanks for that. For clarity, please confirm or refute the following four points:

You have a model that trains deterministically on a single GPU.
When you use more than one GPU (including only two GPUs), you get nondeterminism.
You're using tf.data.experimental.sample_from_datasets.
When you remove only tf.data.experimental.sample_from_datasets, the nondeterminism goes away.
The newest version of TensorFlow that you have reproduced this issue on is 2.7.

Please answer the following question:

Are you using XLA (e.g. @tf.function(jit_compile=True))?

kabilan6 · 2022-02-03T22:40:43Z

Hello, this deterministic issue is there across and not specific to tf.data.experimental.sample_from_datasets. Please refer a similar issue which i created
#54259.

Please find below my response

You have a model that trains deterministically on a single GPU. - amn't sure about training but i guess its not related to training.
When you use more than one GPU (including only two GPUs), you get nondeterminism. -- > i noticed nondeterminism with single or multiGPU (my example used batch function)
You're using tf.data.experimental.sample_from_datasets.--> no, i used experimental_distribute_dataset
When you remove only tf.data.experimental.sample_from_datasets, the nondeterminism goes away. --> I still noticed issue i guess its tied to tf.data.batch
The newest version of TensorFlow that you have reproduced this issue on is 2.7. --> yes for my usecase i noticed it with 2.7version

duncanriach · 2022-02-03T22:58:09Z

Okay, @kabilan6. From looking at your answers, I'm almost certain that you're dealing with a different issue because (1) your issue occurs with a single GPU and (2) you're not using sample_from_datasets. Thanks for opening #54259.

google-ml-butler · 2022-02-10T23:13:41Z

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

yanniskar · 2022-02-11T04:13:04Z

@duncanriach sorry for going radio silent for two weeks. Work has really picked up lately so I have not had time to come back to this. Here are my responses to your comments in order:

I will try to reproduce the issue using a self-contained reproducer like you suggested. I will also do this using the latest Tensorflow version. Once I do that, I will report my findings back in this thread.
As far as I know, I am not using XLA. For more context, I am using a simple keras model.fit training loop with mirrored strategy for this problem.

How does that plan sound? Lmk if you want more info on the XLA matter.

duncanriach · 2022-02-11T04:26:21Z

Sounds great. Thanks, @yanniskar.

plentx · 2022-02-11T04:28:56Z

您好，我是羊峻霄，我已经收到您的邮件，谢谢

gadagashwini · 2022-02-24T06:48:57Z

@yanniskar, Did you try @duncanriach workaround.
And Please let us know if this is still an issue. Thanks!

yanniskar · 2022-02-26T20:01:36Z

@yanniskar, Did you try @duncanriach workaround. And Please let us know if this is still an issue. Thanks!

No luck unfortunately. Work has really picked up so I have not had time to investigate this due to competing priorities and this not being a blocking issue for development. Feel free to close this issue and I will circle back once I find some time (probably on my next PTO) to run the investigation I proposed. I don't think it is fair considering this an active issue given I have not verified it on the latest version of Tensorflow.

gadagashwini · 2022-03-02T02:37:54Z

Thanks for confirming @yanniskar.
If you face same issue on latest version please feel free to reopen. Thanks!

google-ml-butler · 2022-03-02T02:38:29Z

Are you satisfied with the resolution of your issue?
Yes
No

yanniskar added the type:bug Bug label Jan 21, 2022

google-ml-butler bot assigned sushreebarsa Jan 21, 2022

yanniskar mentioned this issue Jan 21, 2022

tf.data.experimental.sample_from_datasets non-deterministic in multi-gpu. NVIDIA/framework-reproducibility#39

Open

sushreebarsa added comp:data tf.data related issues TF 2.4 for issues related to TF 2.4 stat:awaiting response Status - Awaiting response from author labels Jan 21, 2022

sushreebarsa removed the stat:awaiting response Status - Awaiting response from author label Jan 22, 2022

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Jan 22, 2022

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jan 29, 2022

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Jan 29, 2022

duncanriach mentioned this issue Feb 3, 2022

inference/prediction batch is non deterministic and overlap data #54259

Closed

google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Feb 10, 2022

google-ml-butler bot removed stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author labels Feb 11, 2022

sushreebarsa removed the comp:data tf.data related issues label Feb 22, 2022

sushreebarsa added the comp:dist-strat Distribution Strategy related issues label Feb 22, 2022

sushreebarsa assigned gadagashwini and unassigned sushreebarsa Feb 22, 2022

gadagashwini added the stat:awaiting response Status - Awaiting response from author label Feb 24, 2022

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Feb 28, 2022

gadagashwini closed this as completed Mar 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf.data.experimental.sample_from_datasets non-deterministic in multi-gpu. #53846

tf.data.experimental.sample_from_datasets non-deterministic in multi-gpu. #53846

yanniskar commented Jan 21, 2022

yanniskar commented Jan 21, 2022

duncanriach commented Jan 21, 2022

reedwm commented Jan 21, 2022

sushreebarsa commented Jan 22, 2022

yanniskar commented Jan 27, 2022

duncanriach commented Jan 29, 2022 •

edited

Loading

duncanriach commented Feb 3, 2022

kabilan6 commented Feb 3, 2022 •

edited

Loading

duncanriach commented Feb 3, 2022 •

edited

Loading

kabilan6 commented Feb 3, 2022

duncanriach commented Feb 3, 2022

google-ml-butler bot commented Feb 10, 2022

yanniskar commented Feb 11, 2022 •

edited

Loading

duncanriach commented Feb 11, 2022

plentx commented Feb 11, 2022 via email

gadagashwini commented Feb 24, 2022

yanniskar commented Feb 26, 2022

gadagashwini commented Mar 2, 2022

google-ml-butler bot commented Mar 2, 2022

tf.data.experimental.sample_from_datasets non-deterministic in multi-gpu. #53846

tf.data.experimental.sample_from_datasets non-deterministic in multi-gpu. #53846

Comments

yanniskar commented Jan 21, 2022

yanniskar commented Jan 21, 2022

duncanriach commented Jan 21, 2022

reedwm commented Jan 21, 2022

sushreebarsa commented Jan 22, 2022

yanniskar commented Jan 27, 2022

duncanriach commented Jan 29, 2022 • edited Loading

duncanriach commented Feb 3, 2022

kabilan6 commented Feb 3, 2022 • edited Loading

duncanriach commented Feb 3, 2022 • edited Loading

kabilan6 commented Feb 3, 2022

duncanriach commented Feb 3, 2022

google-ml-butler bot commented Feb 10, 2022

yanniskar commented Feb 11, 2022 • edited Loading

duncanriach commented Feb 11, 2022

plentx commented Feb 11, 2022 via email

gadagashwini commented Feb 24, 2022

yanniskar commented Feb 26, 2022

gadagashwini commented Mar 2, 2022

google-ml-butler bot commented Mar 2, 2022

duncanriach commented Jan 29, 2022 •

edited

Loading

kabilan6 commented Feb 3, 2022 •

edited

Loading

duncanriach commented Feb 3, 2022 •

edited

Loading

yanniskar commented Feb 11, 2022 •

edited

Loading