Skip to content

TonY Configurations

Phat Dai Tran edited this page Sep 21, 2018 · 26 revisions

Application Properties:

Name Default Meaning
tony.other.namenodes Namenode URIs to get delegation tokens from.
tony.yarn.queue default Default queue to submit to YARN.
tony.application.name TensorFlowApplication Name of your YARN application.
tony.application.node-label YARN partition which this application should run in.
tony.application.single-node false Whether this is single node training or not.
tony.application.enable-preprocess false Whether the AM should invoke the user's python script or not.
tony.application.timeout 0 Max runtime of the application before killing it, in milliseconds.

Task Properties:

Name Default Meaning
tony.task.executor.jvm.opts -Xmx1536m JVM opts for each TaskExecutor.
tony.task.registration-timeout-sec 300 Timeout, in seconds, for AM to resubmit unregistered tasks (or fail if no retries configured).
tony.task.registration-retry-count 0 How many times we should resubmit unregistered tasks after the timeout interval.
tony.task.heartbeat-interval 1000 Frequency, in milliseconds, for which TaskExecutors should heartbeat with AM.
tony.task.max-missed-heartbeats 25 How many missed heartbeats before declaring a TaskExecutor dead.

AM Configuration

Name Default Meaning
tony.am.retry-count 0 How many times a failed AM should retry.
tony.am.memory 2g AM memory size, requested as a string (e.g. '2g' or '2048m').
tony.am.vcores 1 Number of AM vcores to use.
tony.am.gpus 0 Number of AM GPUs to use. (In general, should only be applicable in single node mode.)

Executor Configuration

Name Default Meaning
tony.ps.memory 2g Parameter server memory size, requested as a string (e.g. '2g' or '2048m').
tony.ps.vcores 1 Number of vcores per parameter server.
tony.ps.instances 1 Number of parameter servers to request.
tony.ps.instances 0 Timeout, in milliseconds for the user's python processes before forcibly killing them.
tony.worker.memory 2g Worker memory size, requested as a string (e.g. '2g' or '2048m').
tony.worker.vcores 1 Number of vcores per worker.
tony.worker.gpus 0 Number of GPUs per worker.
tony.worker.instances 1 Number of workers to request.

Others

Name Default Meaning
tony.application.security.enabled true Whether this application is running in a Kerberized grid. Setting this to true will fetch tokens from the cluster as well as between the client and AM.
tony.application.hdfs-conf-path Path to HDFS configuration, to be passed as an environment variable to the python training scripts.
tony.application.yarn-conf-path Path to YARN configuration, to be passed as an environment variable to the python training scripts.
Clone this wiki locally