Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for tf.distribute.Strategy #87

Open
cweill opened this issue Mar 25, 2019 · 7 comments
Open

Support for tf.distribute.Strategy #87

cweill opened this issue Mar 25, 2019 · 7 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@cweill
Copy link
Contributor

cweill commented Mar 25, 2019

AdaNet doesn't currently support tf.distribute.Strategy. The current way to define distributed training is using a tf.estimator.RunConfig with the TF_CONFIG environment variable properly set to identify different workers.

Refs #76

@cweill cweill added the enhancement New feature or request label Mar 25, 2019
@cweill cweill self-assigned this Mar 25, 2019
@cweill cweill pinned this issue Apr 8, 2019
@cweill cweill added the help wanted Extra attention is needed label Apr 8, 2019
@cweill cweill unpinned this issue May 3, 2019
@chamorajg
Copy link
Contributor

chamorajg commented May 22, 2019

I am new to this source code. I want to contribute to few open source AI, and ML projects to gain experience. Can I work on this issue ? Can you suggest me on what to be done ? I went through the code there are certain TODO's in placement.py written in comment section, if given permission and some guidance can I work on that ?

@chamorajg
Copy link
Contributor

@cweill Can you just give me a vague idea on how to make adanet support tf.distribute.Strategy. I have good experience with tensorflow but the source code is quite big to search for, it would be helpful for me to make a quick start.

@cweill
Copy link
Contributor Author

cweill commented Jun 3, 2019

@chandramoulirajagopalan: The best way to get started will be to first extend estimator_distributed_test_runner.py to test your implementation. You can pass then pass the tf.distribute.Strategy you want to test to the tf.estimator.RunConfig, when constructing the AdaNet Estimator. If it works, then great! If it doesn't then feel free to post your update here, and we'll work through it together.

@chamorajg
Copy link
Contributor

chamorajg commented Jun 3, 2019

@cweill Yes I will work on that file first to test my implementation on the estimator similar to issue #54. Where the tf.distribute.Strategy.MirroredStrategy was used in it.

@cweill
Copy link
Contributor Author

cweill commented Jun 3, 2019

@chandramoulirajagopalan: Just a heads up: tf.distribute.MirroredStrategy I believe is designed for multi-GPU, so may be difficult to test. But if you get it to run inside estimator_distributed_test_runner.py, then good work. Let us know if you have any questions.

@chamorajg
Copy link
Contributor

FAIL: test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps (adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest)
test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps(estimator='estimator_with_distributed_mirrored_strategy', placement_strategy='replication', num_workers=5, num_ps=3)
----------------------------------------------------------------------
   Traceback (most recent call last):
    /home/chandramouli/.local/lib/python3.6/site-packages/absl/testing/parameterized.py line 262 in bound_param_test
      test_method(self, **testcase_params)
    adanet/core/estimator_distributed_test.py line 325 in test_distributed_training
      timeout_secs=500)
    adanet/core/estimator_distributed_test.py line 169 in _wait_for_processes
      self.assertEqual(0, ret_code)
   AssertionError: 0 != 1
   -------------------- >> begin captured logging << --------------------
   absl: INFO: Spawning chief_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning worker_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning worker_1 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning worker_2 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning worker_3 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning ps_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning ps_1 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning ps_2 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning evaluator_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: worker_0 finished
   absl: INFO: stderr for worker_0 (last 15000 chars): WARNING: Logging before flag parsing goes to stderr.
   W0608 00:13:42.569955 140367193364288 report_accessor.py:36] Failed to import report_pb2. ReportMaterializer will not work.
   I0608 00:13:42.935579 140367193364288 run_config.py:503] TF_CONFIG environment variable: {'cluster': {'chief': ['localhost:38127'], 'worker': ['localhost:44993', 'localhost:55967', 'localhost:53003', 'localhost:59883'], 'ps': ['localhost:37729', 'localhost:55971', 'localhost:38587']}, 'task': {'type': 'worker', 'index': 0}}
   2019-06-08 00:13:42.936921: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
   2019-06-08 00:13:42.983584: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1999890000 Hz
   2019-06-08 00:13:42.983983: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x20dff50 executing computations on platform Host. Devices:
   2019-06-08 00:13:42.984038: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
   I0608 00:13:43.000412 140367193364288 cross_device_ops.py:975] Device is available but not used by distribute strategy: /device:XLA_CPU:0
   W0608 00:13:43.001587 140367193364288 cross_device_ops.py:983] Not all devices in `tf.distribute.Strategy` are visible to TensorFlow.
   I0608 00:13:43.001820 140367193364288 run_config.py:503] TF_CONFIG environment variable: {'cluster': {'chief': ['localhost:38127'], 'worker': ['localhost:44993', 'localhost:55967', 'localhost:53003', 'localhost:59883'], 'ps': ['localhost:37729', 'localhost:55971', 'localhost:38587']}, 'task': {'type': 'worker', 'index': 0}}
   I0608 00:13:43.002067 140367193364288 run_config.py:532] Initializing RunConfig with distribution strategies.
   I0608 00:13:43.002384 140367193364288 estimator_training.py:176] RunConfig initialized for Distribute Coordinator with INDEPENDENT_WORKER mode
   W0608 00:13:43.003975 140367193364288 estimator.py:1760] Using temporary folder as model directory: /tmp/tmpv70w4krt
   I0608 00:13:43.004830 140367193364288 estimator.py:201] Using config: {'_model_dir': '/tmp/tmpv70w4krt', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
   graph_options {
     rewrite_options {
       meta_optimizer_iterations: ONE
     }
   }
   , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7fa98a429748>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa98a4299b0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': 'independent_worker'}
   Traceback (most recent call last):
     File "adanet/core/estimator_distributed_test_runner.py", line 350, in <module>
       app.run(main)
     File "/home/chandramouli/.local/lib/python3.6/site-packages/absl/app.py", line 300, in run
       _run_main(main, args)
     File "/home/chandramouli/.local/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
       sys.exit(main(argv))
     File "adanet/core/estimator_distributed_test_runner.py", line 346, in main
       train_and_evaluate_estimator()
     File "adanet/core/estimator_distributed_test_runner.py", line 318, in train_and_evaluate_estimator
       classifier.train(input_fn=_input_fn)
     File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
       loss = self._train_model(input_fn, hooks, saving_listeners)
     File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
       return self._train_model_distributed(input_fn, hooks, saving_listeners)
     File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1180, in _train_model_distributed
       hooks)
     File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow/python/distribute/estimator_training.py", line 302, in estimator_train
       if 'evaluator' in cluster_spec:
   TypeError: argument of type 'ClusterSpec' is not iterable
   
   
   --------------------- >> end captured logging << ---------------------

@cweill
Copy link
Contributor Author

cweill commented Jun 7, 2019

Good work getting that inside the runner. I'm surprised that the error is coming from deep down in TensorFlow Estimator. If you create a PR, I can have a look there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants