TonY Configurations

Application Properties:

Name	Default	Meaning
tony.other.namenodes		Namenode URIs to get delegation tokens from.
tony.yarn.queue	default	Default queue to submit to YARN.
tony.application.name	TensorFlowApplication	Name of your YARN application.
tony.application.node-label		YARN partition which this application should run in.
tony.application.single-node	false	Whether this is single node training or not.
tony.application.enable-preprocess	false	Whether the AM should invoke the user's python script or not.
tony.application.timeout	0	Max runtime of the application before killing it, in milliseconds.
tony.application.untracked.jobtypes	ps	Comma-separated list of task types that TonY will not track and wait for to finish. Once all other task types have finished, the TonY AM will exit.
tony.containers.resources	n/a	A list of resources to be localized to all containers, delimited by comma. If a resource has no scheme like hdfs:// or s3://, the file is considered a local file. You could add #archive annotation, if an entry has #archive, the file will be automatically unzipped when localized to the containers, folder name is the same as the file name. For example: /user/khu/abc.zip#archive would be inferred as a local file and will be unarchived in containers. You would anticipate an abc.zip/ folder in your container's working directory.

Task Properties:

Name	Default	Meaning
tony.task.executor.jvm.opts	-Xmx1536m	JVM opts for each TaskExecutor.
tony.task.heartbeat-interval	1000	Frequency, in milliseconds, for which TaskExecutors should heartbeat with AM.
tony.task.max-missed-heartbeats	25	How many missed heartbeats before declaring a TaskExecutor dead.

AM Configuration

Name	Default	Meaning
tony.am.retry-count	0	How many times a failed AM should retry. On retry, all tasks (workers, ps, etc.) will be relaunched.
tony.am.memory	2g	AM memory size, requested as a string (e.g. '2g' or '2048m').
tony.am.vcores	1	Number of AM vcores to use.
tony.am.gpus	0	Number of AM GPUs to use. (In general, should only be applicable in single node mode.)

Task Configuration

Name	Default	Meaning
tony.X.instances	1	Number of tasks for TensorFlowJob "X", default 1 if X=ps or X=worker, 0 otherwise.
tony.X.memory	2g	Memory size per task in TensorFlow job "X", requested as a string (e.g. '2g' or '2048m').
tony.X.vcores	1	Number of vcores per task in TensorFlow job "X".
tony.X.gpus	0	Number of GPUs per task in TensorFlow job "X".
tony.X.resources	n/a	A list of resources to be localized to all containers running "X" jobtype, delimited by comma. If a resource has no scheme like hdfs:// or s3://, the file is considered a local file. You could add #archive annotation, if an entry has #archive, the file will be automatically unzipped when localized to the containers, folder name is the same as the file name. For example: /user/khu/abc.zip#archive would be inferred as a local file and will be unarchived in containers. You would anticipate an abc.zip/ folder in your container's working directory.

TonY determines which TensorFlow job types to allocate based on configurations of the form "tony.X.instances". For each job "X", it will also search for resource requests corresponding to this TensorFlow job.

For example, you can configure a ps, worker, and chief job via:

<configuration>
  <property>
    <name>tony.worker.instances</name>
    <value>4</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>4g</value>
  </property>
  <property>
    <name>tony.worker.gpus</name>
    <value>1</value>
  </property>
  <property>
    <name>tony.worker.instances</name>
    <value>4</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>4g</value>
  </property>
  <property>
    <name>tony.worker.resources</name>
    <value>hdfs://namenode:9000/user/tony/hello.py</value>
  </property>
  <property>
    <name>tony.ps.memory</name>
    <value>3g</value>
  </property>
  <property>
    <name>tony.chief.instances</name>
    <value>1</value>
  </property>
  <property>
    <name>tony.chief.memory</name>
    <value>6g</value>
  </property>
  <property>
    <name>tony.chief.gpus</name>
    <value>1</value>
  </property>
</configuration>

Note that TonY will configure default one ps and one worker and no other TensorFlow jobs (in this case, there will be four workers allocated since this is explicitly configured, and one ps since "tony.ps.instances" is omitted). Furthermore TonY will also configure one chief task since "tony.chief.instances" is configured to 1, and this task will have 6 GB and 1 GPU allocated for it.

Others

Name	Default	Meaning
tony.application.security.enabled	true	Whether this application is running in a Kerberized grid. Setting this to `true` will fetch tokens from the cluster as well as between the client and AM.
tony.application.hdfs-conf-path		Path to HDFS configuration, to be passed as an environment variable to the python training scripts.
tony.application.yarn-conf-path		Path to YARN configuration, to be passed as an environment variable to the python training scripts.

Provide feedback

Saved searches