From 28c9aec0092e73dadf3591e8bbee5033bc20ad18 Mon Sep 17 00:00:00 2001 From: Kishan Bhoopalam Date: Fri, 19 Jan 2024 12:59:14 -0600 Subject: [PATCH] Update distributed-jobs.rst to address ulimit issue Distributed jobs will result in `OSError: [Errno 24] Too many open files` unless ulimit is increased. I put a note in the instructions incase others run into this issue. --- docs/source/running-jobs/distributed-jobs.rst | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index 0f9ce453561..1e1096e1b3a 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -46,6 +46,19 @@ In the above, - :code:`num_nodes: 2` specifies that this task is to be run on 2 nodes, with each node having 4 V100s; - The highlighted lines in the ``run`` section show common environment variables that are useful for launching distributed training, explained below. +.. note:: + + If you encounter the error `[Errno 24] Too many open files`, this indicates that your process has exceeded the maximum number of open file descriptors allowed by the system. This often occurs in high-load scenarios where your application opens more files simultaneously than the system's limit permits. + + To resolve this issue, you can increase the `ulimit` in your shell. Run the following command: + + :: + + ulimit -n 65535 + + This command sets the limit of open files to 65535 for the current shell session. Note that this change is temporary and only applies to the current session. For a more permanent solution, you might need to adjust the system's configuration files or consult the documentation for your operating system. + + Environment variables -----------------------------------------