[Bug]: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use #8204

youkaichao · 2024-09-05T17:35:19Z

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

🐛 Describe the bug

This is a bug we encounter a lot in our ci, e.g. https://buildkite.com/vllm/ci-aws/builds/8098#0191bf43-446d-411d-80c7-3ba10bc392e8/192-1557

I have been tracking this for months, and try to add more logging information to help debugging.

from the logging information:

[2024-09-05T00:38:34Z] INFO: Started server process [60858]

| [2024-09-05T00:38:34Z] INFO: Waiting for application startup.
| [2024-09-05T00:38:34Z] INFO: Application startup complete.
| [2024-09-05T00:38:34Z] ERROR: [Errno 98] error while attempting to bind on address ('0.0.0.0', 44319): [errno 98] address already in use
| [2024-09-05T00:38:34Z] INFO: Waiting for application shutdown.
| [2024-09-05T00:38:34Z] INFO: Application shutdown complete.
| [2024-09-05T00:38:34Z] DEBUG 09-04 17:38:34 launcher.py:64] port 44319 is used by process psutil.Process(pid=60914, name='pt_main_thread', status='sleeping', started='17:37:05') launched with command:
| [2024-09-05T00:38:34Z] DEBUG 09-04 17:38:34 launcher.py:64] /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=16, pipe_handle=18) --multiprocessing-fork

we can see that the server process is pid 60858 , and the port 44319 is used by process 60914. scrolling up a little bit, we can find:

[2024-09-05T00:37:05Z] INFO 09-04 17:37:05 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/b6851f4d-4d78-46b8-baba-ae179b0088c2 for RPC Path.

| [2024-09-05T00:37:05Z] INFO 09-04 17:37:05 api_server.py:176] Started engine process with PID 60914

it becomes clear that this is the engine process.

I think the problem here, is that we only bind the port after the engine is ready. During engine setup, it might use some ports for ray, or for distributed communication.

there are two possible solutions:

the api server immediately binds to the port after start, and returns unready status when client queries the /healthy endpoint
the api server binds the port immediately (via socket.socket(socket.AF_INET, socket.SOCK_STREAM).bind(("", uvicorn_kwargs["port"]))), and after engine is up, it releases the port, and bind again to serve requests

I think 1 might be better. 2 would suffer from the fact that client will get 404 not found before the engine is up, because this is just a raw socket without any response.

cc @robertgshaw2-neuralmagic @njhill @joerunde

also cc @richardliaw @rkooo567 how to turn on verbose ray logging, so that we can verify if the port is indeed used by ray.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-09-05T17:44:30Z

more examples:

https://buildkite.com/vllm/ci-aws/builds/7739#01919819-4c6b-4704-ae78-f9c9d2fa4614

https://buildkite.com/vllm/ci-aws/builds/7362#01917722-c7bb-4e53-8aa4-5925103dc366

joerunde · 2024-09-05T17:57:21Z

(1) might require some more inversion of the startup code to boot the server first and have it wait for an engine to come up, where currently the engine is first booted then passed to the server initializer. But I agree, that's a good solution- after determining that a port is available it should be immediately bound so that subsequent calls to get_open_port() don't return the same port.

2 would suffer from the fact that client will get 404 not found before the engine is up, because this is just a raw socket without any response.

I don't think that's true though, if there's nothing actually responding to connections on the socket then all clients would just drop with a connection error after their connection timeout, right? If so that might be an acceptable quick and dirty solution

youkaichao · 2024-09-05T18:51:46Z

I don't think that's true though, if there's nothing actually responding to connections on the socket then all clients would just drop with a connection error after their connection timeout, right? If so that might be an acceptable quick and dirty solution

I'm not sure, but I think it is worth a trial!

youkaichao · 2024-09-08T00:56:01Z

another example that the ports seem to be held by ray:

https://buildkite.com/vllm/ci-aws/builds/8239#0191ce16-2554-4377-b5a1-d66e87987a7f

[2024-09-07T20:40:25Z] DEBUG 09-07 13:40:25 launcher.py:64] port 53777 is used by process psutil.Process(pid=14200, name='ray::RayWorkerWrapper', status='sleeping', started='13:39:15') launched with command:

| [2024-09-07T20:40:25Z] DEBUG 09-07 13:40:25 launcher.py:64] ray::RayWorkerWrapper

rkooo567 · 2024-09-10T03:13:08Z

another example that the ports seem to be held by ray:

Quick qusetion @youkaichao. I understand the port is taken by ray, but it seems like open ai server tries to bind it. do you know why? (I assume the server should have a way to avoid binding to a port that's already in use). Or is openai server choosing a port ahead of time as well?

youkaichao · 2024-09-10T07:01:12Z

@rkooo567 we first choose a port for the api server, and then the api server starts the engine, only after that, it starts to bind that port.

so, when we start the engine, the api server is not listening to the port yet. and ray may happen to use that port.

rkooo567 · 2024-09-10T17:00:59Z

I see. I like solution 1 in this case. seems like a clean solution

russellb · 2024-11-05T00:19:18Z

Please see #9737 and #10012 for follow-ups to this issue.

youkaichao added the bug Something isn't working label Sep 5, 2024

kevin314 mentioned this issue Sep 14, 2024

[Bugfix] Bind api server port before starting engine #8491

Merged

youkaichao closed this as completed in #8491 Sep 16, 2024

wallashss mentioned this issue Jan 3, 2025

[Hardware][Apple] Native support for macOS Apple Silicon #11696

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use #8204

[Bug]: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use #8204

youkaichao commented Sep 5, 2024

[2024-09-05T00:38:34Z] INFO: Started server process [60858]

[2024-09-05T00:37:05Z] INFO 09-04 17:37:05 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/b6851f4d-4d78-46b8-baba-ae179b0088c2 for RPC Path.

youkaichao commented Sep 5, 2024

joerunde commented Sep 5, 2024

youkaichao commented Sep 5, 2024

youkaichao commented Sep 8, 2024

[2024-09-07T20:40:25Z] DEBUG 09-07 13:40:25 launcher.py:64] port 53777 is used by process psutil.Process(pid=14200, name='ray::RayWorkerWrapper', status='sleeping', started='13:39:15') launched with command:

rkooo567 commented Sep 10, 2024

youkaichao commented Sep 10, 2024

rkooo567 commented Sep 10, 2024

russellb commented Nov 5, 2024

[Bug]: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use #8204

[Bug]: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use #8204

Comments

youkaichao commented Sep 5, 2024

Your current environment

🐛 Describe the bug

[2024-09-05T00:38:34Z] INFO: Started server process [60858]

[2024-09-05T00:37:05Z] INFO 09-04 17:37:05 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/b6851f4d-4d78-46b8-baba-ae179b0088c2 for RPC Path.

Before submitting a new issue...

youkaichao commented Sep 5, 2024

joerunde commented Sep 5, 2024

youkaichao commented Sep 5, 2024

youkaichao commented Sep 8, 2024

[2024-09-07T20:40:25Z] DEBUG 09-07 13:40:25 launcher.py:64] port 53777 is used by process psutil.Process(pid=14200, name='ray::RayWorkerWrapper', status='sleeping', started='13:39:15') launched with command:

rkooo567 commented Sep 10, 2024

youkaichao commented Sep 10, 2024

rkooo567 commented Sep 10, 2024

russellb commented Nov 5, 2024