-
Notifications
You must be signed in to change notification settings - Fork 74.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf.data.experimental.sample_from_datasets non-deterministic in multi-gpu. #53846
Comments
@reedwm, bringing your attention to this potential issue for the 2.8 release. @yanniskar, I see that you're using version 2.4.1; are you able to see if this issue exists in the 2.8.0-rc0 pre-release (e.g. by using the |
Agree with @duncanriach that you should try on 2.8.0rc0. I'm not sure why this would be nondeterministic on TF 2.4, but it's possible this was fixed since then. |
@yanniskar Could you please try with TF v2.8.0rc0 and let us know the outcome? Please refer to the above comments as well. Thanks! |
Thanks for the response @sushreebarsa @duncanriach. I need to use the tensorflow version my org is on. Thus, I cannot upgrade to 2.8 to test this as my org does not support cloud training on that tensorflow version. If it is hard for you guys to verify this in 2.8, I can look into a workaround for testing my code in 2.8. Given other work priorities, it might take me some time before I am able to do this though. |
@yanniskar, if you cannot confirm that this issue is not present in the latest version of TensorFlow, then we cannot either, because we do not have access to the code you're running in order to run it on the latest version ourselves. Another way to move this issue forward is for you to attempt to provide a self-contained reproducer, a simple piece of code that you can run and also share with us. That way, we can look at exactly what you're looking at, observe it on version 2.4.1, test if it's still present on the latest version, and then be able to potentially debug it. There are too many variables in these systems to be able to debug something that we cannot examine. The following reproducer code demonstrates the kind of minimal example that could be provided to reproduce the observed multi-device (distributed) nondeterminism. This example distributes the dataset to both the GPU and CPU, with the two-element batch being split into one element per device. For this kind of problem (probably related to dataset distribution between devices), I doubt it matters what kind of devices are used. The intention should be to recreate the basic configuration as accurately and minimally as possible. For example, it would probably be important to capture the distribution strategy used and when and how the dataset(s) are distributed (such as before or after applying sample_from_datasets). With determinism, the devices should print the same sequence of values on each run, as they currently do in this example. The following code can be run and modified in a copy of this colab notebook. dataset1 = tf.data.Dataset.from_tensor_slices([[10, 11], [12, 13], [14, 15], [16, 17]])
dataset2 = tf.data.Dataset.from_tensor_slices([[21, 22], [23, 24], [25, 26], [27, 28]])
sample_dataset = tf.data.experimental.sample_from_datasets(
[dataset1, dataset2], weights=[0.5, 0.5], seed=43)
my_strategy = tf.distribute.MirroredStrategy(["GPU:0", "CPU:0"])
with my_strategy.scope():
@tf.function
def distribute_train_epoch(dataset):
for x in dataset:
my_strategy.run(print, args=(x,))
dist_dataset = my_strategy.experimental_distribute_dataset(sample_dataset)
for _ in range(2):
print("------------------")
distribute_train_epoch(dist_dataset) Output:
|
@yanniskar, are you using XLA (e.g. |
when the dataset contains 1000 to millions , its non deterministic. we noticed similar problem with tf 2.7 version. |
Hi @kabilan6, Thanks for that. For clarity, please confirm or refute the following four points:
Please answer the following question: Are you using XLA (e.g. |
Hello, this deterministic issue is there across and not specific to tf.data.experimental.sample_from_datasets. Please refer a similar issue which i created Please find below my response You have a model that trains deterministically on a single GPU. - amn't sure about training but i guess its not related to training. |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
@duncanriach sorry for going radio silent for two weeks. Work has really picked up lately so I have not had time to come back to this. Here are my responses to your comments in order:
How does that plan sound? Lmk if you want more info on the XLA matter. |
Sounds great. Thanks, @yanniskar. |
您好,我是羊峻霄,我已经收到您的邮件,谢谢
|
@yanniskar, Did you try @duncanriach workaround. |
No luck unfortunately. Work has really picked up so I have not had time to investigate this due to competing priorities and this not being a blocking issue for development. Feel free to close this issue and I will circle back once I find some time (probably on my next PTO) to run the investigation I proposed. I don't think it is fair considering this an active issue given I have not verified it on the latest version of Tensorflow. |
Thanks for confirming @yanniskar. |
See NVIDIA/framework-reproducibility#39
System information
The text was updated successfully, but these errors were encountered: