Skip to content

Remote Development and Debugging

Albert Zeyer edited this page Dec 19, 2024 · 5 revisions

JetBrains Gateway to run PyCharm remotely. See also PyCharm Configuration. This allows to run PyCharm remotely (similarly to how VSCode does it). This also works now via any SSH ProxyCommands (for more SSH tunneling), reuses the local SSH settings, so that's how I run PyCharm remotely on my local i6 desktop or on the RWTH ITC cluster login node.

PyCharm has a Python debug server which works over TCP. Open the debug options, add a new configuration, select Python debug server. Then it explains what to do next. You specify a port where the server listens on. The client running the Python code you want to debug would connect via TCP.

  • On the client, where you want to debug, you need to do: pip install pydevd-pycharm~=232.10227.11 (maybe other version for other pycharm) or otherwise set sys.path correctly that it finds pydevd-pycharm (it should be inside the pycharm dir somewhere).
  • Then you need to add this code where you want to start the debugging: import pydevd_pycharm; pydevd_pycharm.settrace($SERVER, port=$SERVER_PORT). Other options for this settrace function: trace_only_current_thread, suspend, stdoutToServer, stderrToServer.

As an example, I currently do this on the RWTH ITC, where I run PyCharm on the login node via Gateway, and then I start some multi-GPU training, and I have this in my RETURNN config to debug the rank 1 instance:

def startup_callback(*, config, **kwargs):
    import returnn.torch.distributed
    ctx = returnn.torch.distributed.get_ctx(config=config)
    if ctx and ctx.rank() == 1:
       import pydevd_pycharm
       pydevd_pycharm.settrace("login18-1", port=31337, suspend=False)

Then it jumps into the debugger and I can debug as usual, just like it would be if this is local.

(Note that distributed PyTorch (rendezvous, c10d, etc) has multiple timeouts, which might become a problem, although this does not seem a problem so far for me.)

(I had another stupid problem before, that I started the training in an interactive session with srun ... --pty bash -i, but i forgot to set the memory, and then my procs died, which can also happen because of timeout or other reasons, and only later i found via dmesg that the cgroup OOM killer killed them. This caused very weird behavior, because they were crashing mostly at similar locations, but not exactly the same.)

(Note, PyCharm 2024.3 has some bug where the connection just hangs: PY-77357, including a workaround)