-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use #8204
Comments
(1) might require some more inversion of the startup code to boot the server first and have it wait for an engine to come up, where currently the engine is first booted then passed to the server initializer. But I agree, that's a good solution- after determining that a port is available it should be immediately bound so that subsequent calls to
I don't think that's true though, if there's nothing actually responding to connections on the socket then all clients would just drop with a connection error after their connection timeout, right? If so that might be an acceptable quick and dirty solution |
I'm not sure, but I think it is worth a trial! |
another example that the ports seem to be held by https://buildkite.com/vllm/ci-aws/builds/8239#0191ce16-2554-4377-b5a1-d66e87987a7f
|
Quick qusetion @youkaichao. I understand the port is taken by ray, but it seems like open ai server tries to bind it. do you know why? (I assume the server should have a way to avoid binding to a port that's already in use). Or is openai server choosing a port ahead of time as well? |
@rkooo567 we first choose a port for the api server, and then the api server starts the engine, only after that, it starts to bind that port. so, when we start the engine, the api server is not listening to the port yet. and ray may happen to use that port. |
I see. I like solution 1 in this case. seems like a clean solution |
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
This is a bug we encounter a lot in our ci, e.g. https://buildkite.com/vllm/ci-aws/builds/8098#0191bf43-446d-411d-80c7-3ba10bc392e8/192-1557
I have been tracking this for months, and try to add more logging information to help debugging.
from the logging information:
we can see that the server process is pid 60858 , and the port 44319 is used by process 60914. scrolling up a little bit, we can find:
it becomes clear that this is the engine process.
I think the problem here, is that we only bind the port after the engine is ready. During engine setup, it might use some ports for ray, or for distributed communication.
there are two possible solutions:
/healthy
endpointsocket.socket(socket.AF_INET, socket.SOCK_STREAM).bind(("", uvicorn_kwargs["port"]))
), and after engine is up, it releases the port, and bind again to serve requestsI think 1 might be better. 2 would suffer from the fact that client will get 404 not found before the engine is up, because this is just a raw socket without any response.
cc @robertgshaw2-neuralmagic @njhill @joerunde
also cc @richardliaw @rkooo567 how to turn on verbose ray logging, so that we can verify if the port is indeed used by ray.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: