-
Notifications
You must be signed in to change notification settings - Fork 164
TonY Configurations
Jonathan Hung edited this page Sep 27, 2018
·
26 revisions
Name | Default | Meaning |
---|---|---|
tony.other.namenodes | Namenode URIs to get delegation tokens from. | |
tony.yarn.queue | default | Default queue to submit to YARN. |
tony.application.name | TensorFlowApplication | Name of your YARN application. |
tony.application.node-label | YARN partition which this application should run in. | |
tony.application.single-node | false | Whether this is single node training or not. |
tony.application.enable-preprocess | false | Whether the AM should invoke the user's python script or not. |
tony.application.timeout | 0 | Max runtime of the application before killing it, in milliseconds. |
Name | Default | Meaning |
---|---|---|
tony.task.executor.jvm.opts | -Xmx1536m | JVM opts for each TaskExecutor. |
tony.task.registration-timeout-sec | 300 | Timeout, in seconds, for AM to resubmit unregistered tasks (or fail if no retries configured). |
tony.task.registration-retry-count | 0 | How many times we should resubmit unregistered tasks after the timeout interval. |
tony.task.heartbeat-interval | 1000 | Frequency, in milliseconds, for which TaskExecutors should heartbeat with AM. |
tony.task.max-missed-heartbeats | 25 | How many missed heartbeats before declaring a TaskExecutor dead. |
Name | Default | Meaning |
---|---|---|
tony.am.retry-count | 0 | How many times a failed AM should retry. |
tony.am.memory | 2g | AM memory size, requested as a string (e.g. '2g' or '2048m'). |
tony.am.vcores | 1 | Number of AM vcores to use. |
tony.am.gpus | 0 | Number of AM GPUs to use. (In general, should only be applicable in single node mode.) |
Name | Default | Meaning |
---|---|---|
tony.X.instances | 1 | Number of tasks for TensorFlowJob "X", default 1 if X=ps or X=worker, 0 otherwise. |
tony.X.memory | 2g | Memory size per task in TensorFlow job "X", requested as a string (e.g. '2g' or '2048m'). |
tony.X.vcores | 1 | Number of vcores per task in TensorFlow job "X". |
tony.X.gpus | 0 | Number of GPUs per task in TensorFlow job "X". |
TonY determines which TensorFlow job types to allocate based on configurations of the form "tony.X.instances". For each job "X", it will also search for resource requests corresponding to this TensorFlow job.
For example, you can configure a ps, worker, and chief job via:
<configuration>
<property>
<name>tony.worker.instances</name>
<value>4</value>
</property>
<property>
<name>tony.worker.memory</name>
<value>4g</value>
</property>
<property>
<name>tony.worker.gpus</name>
<value>1</value>
</property>
<property>
<name>tony.worker.instances</name>
<value>4</value>
</property>
<property>
<name>tony.worker.memory</name>
<value>4g</value>
</property>
<property>
<name>tony.ps.memory</name>
<value>3g</value>
</property>
<property>
<name>tony.chief.instances</name>
<value>1</value>
</property>
<property>
<name>tony.chief.memory</name>
<value>6g</value>
</property>
<property>
<name>tony.chief.gpus</name>
<value>1</value>
</property>
</configuration>
Note that TonY will configure default one ps and one worker and no other TensorFlow jobs (in this case, there will be four workers allocated since this is explicitly configured, and one ps since "tony.ps.instances" is omitted). Furthermore TonY will also configure one chief task since "tony.chief.instances" is configured to 1, and this task will have 6 GB and 1 GPU allocated for it.
Name | Default | Meaning |
---|---|---|
tony.application.security.enabled | true | Whether this application is running in a Kerberized grid. Setting this to true will fetch tokens from the cluster as well as between the client and AM. |
tony.application.hdfs-conf-path | Path to HDFS configuration, to be passed as an environment variable to the python training scripts. | |
tony.application.yarn-conf-path | Path to YARN configuration, to be passed as an environment variable to the python training scripts. |