-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf.data.experimental.sample_from_datasets non-deterministic in multi-gpu. #39
Comments
Hi @yanniskar, thank you for isolating this issue, and for providing a reproducer. Please will you open an issue against the stock TensorFlow repo for this. Please reference this current issue. Once you have opened that, I will close this issue. |
Also @yanniskar, for my records of customers who value this work, please can I know a bit more about you? Please could you give me your name and/or your affiliation? |
Done: tensorflow/tensorflow#53846. To answer your other question, here is my linkedin: https://www.linkedin.com/in/yannis-karakozis-746488116/ Thanks for the help on this one :) |
Thank you, and my pleasure. BTW, from TensorFlow version 2.7 onwards, you no longer need to manually serialize ( |
Problem Overview
I train my model on the same dataset in two different setups: A) single-gpu, B) multi-gpu. The former leads to deterministic results, the latter leads to non-deterministic results. The moment I replace the
tf.data.experimental.sample_from_datasets
API call with a direct call totf.data.Datasets
, B also becomes determinsitic.Environment
Python: 3.7
Cuda: 11.2
Tensorflow: 2.4.1
Code
Relevant API: https://www.tensorflow.org/versions/r2.4/api_docs/python/tf/data/experimental/sample_from_datasets
I cannot provide the full code I use due to it being proprietary, but here is the data loading portion. If more information is needed to root cause this, let me know, and I will see what I can do to provide it. FYI the main code sets all the seeds correctly and disables horovod fusion as suggested by the repo README.
Thanks a lot for the great work on making Tensorflow deterministic. It, along with the documentation provided, has been incredibly useful in my day-to-day work.
The text was updated successfully, but these errors were encountered: