-
Notifications
You must be signed in to change notification settings - Fork 940
Frequently Asked Questions
This was a design choice as the most obvious, easiest-to-reason-about mapping of a Spark resource to a TF node. The most visible example is that each executor's logs will only contain the logs of a single TF node, which is much easier to debug vs. interspersed logs. Additionally, most Spark/YARN configurations (e.g. memory) can be almost directly mapped to the TF node. Finally, in Hadoop/YARN environments (which was our primary focus), you can still have multiple executors assigned to the same physical host if there are enough available resources.
Different algorithms are scalable in different ways, depending on whether they're limited by compute, memory, or I/O. In cases where an algorithm can run comfortably on a single-node (completing in a few seconds/minutes), adding the overhead of launching nodes in a cluster, coordinating work between nodes, and communicating gradients/weights across the cluster will generally be slower. For example, the distributed MNIST example is provided mostly to demonstrate the process of porting an (easy-to-understand) distributed TF application to TFoS. It it not intended to run faster than a single-node TF instance, where you can load the entire training set in memory and finish 1000 steps in a few seconds.
Also note that the provided examples are intended to demonstrate the process of converting different types of TF code to TFoS, and not to demonstrate re-writing the code to be more scalable. Currently, these examples have different scalability models, as follows:
- cifar10 - single-node, multi-gpu
- mnist - multi-node, single-gpu-per-node
- imagenet - multi-node, single-gpu-per-node
- slim - single-node, multi-gpu OR multi-node, single-gpu-per-node
Finally, if an application takes a "painfully long" time to complete on a single-node, then it is a good candidate for scaling. However, you will still need to understand what resource is limited and how to change in your code to scale for that resource.
Because we assume that one executor hosts only one TF node, you will encounter this error if your cluster is not configured accordingly. Specifically, this is seen when a RDD feeding task lands on an executor that doesn't contain a TF worker node, e.g. it lands on the PS node or even an idle executor.
There are two main causes for this symptom:
- You are missing the path to the
libhdfs.so
in yourspark.executorEnv.LD_LIBRARY_PATH
. TensorFlow 1.1+ requires this to be set, and unfortunately, it just hangs if it cannot find this library. - If you are using more workers or a higher
--steps
setting, you may need to increase the amount of data being fed to the cluster with the--epochs
argument.