Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use #8204

Closed
1 task done
youkaichao opened this issue Sep 5, 2024 · 8 comments · Fixed by #8491
Closed
1 task done
Labels
bug Something isn't working

Comments

@youkaichao
Copy link
Member

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

🐛 Describe the bug

This is a bug we encounter a lot in our ci, e.g. https://buildkite.com/vllm/ci-aws/builds/8098#0191bf43-446d-411d-80c7-3ba10bc392e8/192-1557

I have been tracking this for months, and try to add more logging information to help debugging.

from the logging information:

[2024-09-05T00:38:34Z] INFO: Started server process [60858]

  | [2024-09-05T00:38:34Z] INFO: Waiting for application startup.
  | [2024-09-05T00:38:34Z] INFO: Application startup complete.
  | [2024-09-05T00:38:34Z] ERROR: [Errno 98] error while attempting to bind on address ('0.0.0.0', 44319): [errno 98] address already in use
  | [2024-09-05T00:38:34Z] INFO: Waiting for application shutdown.
  | [2024-09-05T00:38:34Z] INFO: Application shutdown complete.
  | [2024-09-05T00:38:34Z] DEBUG 09-04 17:38:34 launcher.py:64] port 44319 is used by process psutil.Process(pid=60914, name='pt_main_thread', status='sleeping', started='17:37:05') launched with command:
  | [2024-09-05T00:38:34Z] DEBUG 09-04 17:38:34 launcher.py:64] /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=16, pipe_handle=18) --multiprocessing-fork

we can see that the server process is pid 60858 , and the port 44319 is used by process 60914. scrolling up a little bit, we can find:

[2024-09-05T00:37:05Z] INFO 09-04 17:37:05 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/b6851f4d-4d78-46b8-baba-ae179b0088c2 for RPC Path.

  | [2024-09-05T00:37:05Z] INFO 09-04 17:37:05 api_server.py:176] Started engine process with PID 60914

it becomes clear that this is the engine process.

I think the problem here, is that we only bind the port after the engine is ready. During engine setup, it might use some ports for ray, or for distributed communication.

there are two possible solutions:

  1. the api server immediately binds to the port after start, and returns unready status when client queries the /healthy endpoint
  2. the api server binds the port immediately (via socket.socket(socket.AF_INET, socket.SOCK_STREAM).bind(("", uvicorn_kwargs["port"]))), and after engine is up, it releases the port, and bind again to serve requests

I think 1 might be better. 2 would suffer from the fact that client will get 404 not found before the engine is up, because this is just a raw socket without any response.

cc @robertgshaw2-neuralmagic @njhill @joerunde

also cc @richardliaw @rkooo567 how to turn on verbose ray logging, so that we can verify if the port is indeed used by ray.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@youkaichao youkaichao added the bug Something isn't working label Sep 5, 2024
@joerunde
Copy link
Collaborator

joerunde commented Sep 5, 2024

(1) might require some more inversion of the startup code to boot the server first and have it wait for an engine to come up, where currently the engine is first booted then passed to the server initializer. But I agree, that's a good solution- after determining that a port is available it should be immediately bound so that subsequent calls to get_open_port() don't return the same port.

2 would suffer from the fact that client will get 404 not found before the engine is up, because this is just a raw socket without any response.

I don't think that's true though, if there's nothing actually responding to connections on the socket then all clients would just drop with a connection error after their connection timeout, right? If so that might be an acceptable quick and dirty solution

@youkaichao
Copy link
Member Author

I don't think that's true though, if there's nothing actually responding to connections on the socket then all clients would just drop with a connection error after their connection timeout, right? If so that might be an acceptable quick and dirty solution

I'm not sure, but I think it is worth a trial!

@youkaichao
Copy link
Member Author

another example that the ports seem to be held by ray:

https://buildkite.com/vllm/ci-aws/builds/8239#0191ce16-2554-4377-b5a1-d66e87987a7f

[2024-09-07T20:40:25Z] DEBUG 09-07 13:40:25 launcher.py:64] port 53777 is used by process psutil.Process(pid=14200, name='ray::RayWorkerWrapper', status='sleeping', started='13:39:15') launched with command:

  | [2024-09-07T20:40:25Z] DEBUG 09-07 13:40:25 launcher.py:64] ray::RayWorkerWrapper

@rkooo567
Copy link
Collaborator

another example that the ports seem to be held by ray:

Quick qusetion @youkaichao. I understand the port is taken by ray, but it seems like open ai server tries to bind it. do you know why? (I assume the server should have a way to avoid binding to a port that's already in use). Or is openai server choosing a port ahead of time as well?

@youkaichao
Copy link
Member Author

@rkooo567 we first choose a port for the api server, and then the api server starts the engine, only after that, it starts to bind that port.

so, when we start the engine, the api server is not listening to the port yet. and ray may happen to use that port.

@rkooo567
Copy link
Collaborator

I see. I like solution 1 in this case. seems like a clean solution

@russellb
Copy link
Collaborator

russellb commented Nov 5, 2024

Please see #9737 and #10012 for follow-ups to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants