-
Notifications
You must be signed in to change notification settings - Fork 940
Frequently Asked Questions
In general, InputMode.TENSORFLOW is only using Spark to schedule the TensorFlow nodes onto the Spark executors. Afterwards, it's essentially just a pure distributed TensorFlow cluster. This will give you the best performance, since the TensorFlow nodes will read data directly from disk, using TensorFlow native APIs.
InputMode.SPARK not only schedules the TensorFlow nodes onto the executors, but it also sets up a queue for transferring data from Spark RDDs to the TensorFlow nodes. Data from RDD partitions are pushed into the queue, while the TensorFlow nodes pull data from this queue. Due to this extra hop, the data I/O performance can be significantly slower than InputMode.TENSORFLOW.
In general, we recommend using InputMode.TENSORFLOW with native tf.data
APIs for best performance and only using InputMode.SPARK if you have existing upstream data pipelines written in Spark. Ultimately, the choice depends on your specific use-case. For example, if you have an existing Spark data pipeline that is processing large amounts of data, and the data format is not easily consumable by TensorFlow, you can either A) re-process the data and persist it in a form that is compatible with tf.data
and use InputMode.TENSORFLOW, or B) use InputMode.SPARK to feed the data to TensorFlow. Option A comes with the expense of storing another copy of the data, which can get expensive at larger scales. Option B essentially trades storage for additional compute. Additionally, if you are experimenting with new datasets and feature engineering, it can be easier to just process the datasets dynamically in Spark code instead of re-processing and persisting the data onto disk (in TensorFlow compatible format) each time you adjust a feature.
This was a design choice as the most obvious, easiest-to-reason-about mapping of a Spark resource to a TF node. The most visible example is that each executor's logs will only contain the logs of a single TF node, which is much easier to debug vs. interspersed logs. Additionally, most Spark/YARN configurations (e.g. memory) can be almost directly mapped to the TF node. Finally, in Hadoop/YARN environments (which was our primary focus), you can still have multiple executors assigned to the same physical host if there are enough available resources.
Note: for YARN, spark.executor.cores
is usually defaulted to 1, so we just set spark.task.cpus=1
as well. If your environment has a different setting for spark.executor.cores
, you can just set spark.task.cpus
equal to that number to achieve the "one task per executor" goal.
Different algorithms are scalable in different ways, depending on whether they're limited by compute, memory, or I/O. In cases where an algorithm can run comfortably on a single-node (completing in a few seconds/minutes), adding the overhead of launching nodes in a cluster, coordinating work between nodes, and communicating gradients/weights across the cluster will generally be slower. For example, the distributed MNIST example is provided to demonstrate the process of porting a simple, easy-to-understand, distributed TF application over to TFoS. It it not intended to run faster than a single-node TF instance, where you can actually load the entire training set in memory and finish 1000 steps in a few seconds.
Similarly, the other examples also just demonstrate the conversion of different TF apps, and again they do not to demonstrate re-writing the code to be more scalable. Note that these examples were originally designed with different scalability models, as follows:
- cifar10 - single-node, multi-gpu
- mnist - multi-node, single-gpu-per-node
- imagenet - multi-node, single-gpu-per-node
- slim - single-node, multi-gpu OR multi-node, single-gpu-per-node
Finally, if an application takes a "painfully long" time to complete on a single-node, then it is a good candidate for scaling. However, you will still need to understand which resource is limited and how to modify in your code in order to scale for that resource in a distributed cluster.
Because we assume that one executor hosts only one TF node, you will encounter this error if your cluster is not configured accordingly. Specifically, this is seen when a RDD feeding task lands on an executor that doesn't contain a TF worker node, e.g. it lands on the PS node or even an idle executor.
There are several possible causes for this symptom:
- In Hadoop/YARN environments, you are likely missing the path to the
libhdfs.so
in yourspark.executorEnv.LD_LIBRARY_PATH
. TensorFlow 1.1+ requires this to be set. - If you are using different settings than the original example, you may need to increase the amount of data being fed to the cluster with the
--epochs
argument, and you may need to increase the--steps
argument accordingly. - In some environments, the
hostname
returned for thecluster_spec
is not routable, so the TF nodes cannot connect to each other to form the cluster. There was a PR (#109, merged 20170728) to switch to using IP addresses internally, so you may need to update your version of TFoS.
For InputMode.SPARK, how are epochs
, steps
, batch_size
, and num_workers
correlated and how should they be set?
In general, an epoch is one pass through all of the records in a dataset. For TensorFlow, one step is one pass of the execution graph, e.g. sess.run()
, which is usually done on one batch of the dataset. So, the number of steps in one epoch is equal to the number of records divided by the batch size. And for distributed TensorFlow, these steps will be "sharded" across your workers. So, putting this all together, you'd have:
steps = epochs * num_records_per_epoch / batch_size
steps_per_worker = steps / num_workers
For InputMode.SPARK, we use Spark's UnionRDD
to simulate epochs when feeding a dataset as an RDD. However, TensorFlow applications can often finish before consuming all of the data, e.g. stopping at a maximum number of steps or when a metric reaches some value. To handle these cases, TensorFlowOnSpark provides a TFDataFeed.terminate()
API to stop the Spark data feeding. Unfortunately, Spark must still fully consume and process all of the remaining partitions in order to consider that stage successful, so TensorFlowOnSpark just consumes the remaining Spark partitions without actually sending the data to the TensorFlow process. Obviously, this can be expensive and time-consuming, so it is highly-recommended that you closely match your epochs, batch_size, steps, and workers to avoid unnecessary overhead.
For example, using the formulas above, if you want more epochs
, you should increase your steps
accordingly, or if you increase the batch_size
, you should decrease steps
, etc.
Only one RDD can be passed into the TensorFlow processes at a time. The data from the Spark RDDs are fed into a queue on each executor, from which the local TensorFlow process will pull data. If you have multiple input RDDs with the same cardinality (e.g. features + labels), you can simply zip
the two RDDs together and then extract the components on the TensorFlow side. If you have two distinct RDDs (e.g. train vs. test), you can either run a separate evaluation job that just loads the latest checkpoint on a periodic basis and evaluates on the test RDD, or you can persist the validation RDD to disk and use a tf.data.Dataset
to load that data during the evaluation phase.