Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Add readiness and liveness endpoints to OpenAI API server #7078

Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion vllm/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,9 @@
DetokenizeResponse,
EmbeddingRequest, ErrorResponse,
TokenizeRequest,
TokenizeResponse)
TokenizeResponse,
LivenessResponse,
ReadinessResponse)
# yapf: enable
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion
Expand Down Expand Up @@ -89,6 +91,33 @@ async def health() -> Response:
await openai_serving_chat.engine.check_health()
return Response(status_code=200)

@router.get(
"/liveness",
response_model=LivenessResponse,
name="liveness",
tags=["technical"],
)
async def get_liveness() -> LivenessResponse:
"""Liveness probe for k8s"""
liveness_msg = LivenessResponse(alive="ok")
return liveness_msg


@router.get(
"/readiness",
response_model=ReadinessResponse,
name="readiness",
tags=["technical"],
)
async def get_readiness() -> ReadinessResponse:
"""Readiness probe for k8s"""
model_weights = openai_serving_chat.engine.engine.model_executor.driver_worker.model_runner.model_memory_usage

if model_weights > 0:
return ReadinessResponse(ready="ok")
else:
return ReadinessResponse(ready="ko")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfournioux Thanks for the great PR! I'm kind of new to Kubernetes so I'm a little confused here but it seems like the readiness probe is going to return a 200 OK response irrespective of whether the model is loaded or not right? I was under the assumption that the K8's probes check for status code and not necessarily responses? Should we be adding a test to see what it returns when the model is in fact not loaded?

Regarding your question "it seems like the readiness probe is going to return a 200 OK response irrespective of whether the model is loaded or not right?", in Kubernetes, when you configure your deployment, you can use startup probes to determinate when a container application has started. Liveness and readiness probes do not start until startup probes succeeds. It allows those probes not to interfere with your application startup. This is particularly useful when you have slow starting containers (for model loading for instance) and to do liveness checks on them. It will avoid them getting killed before they are up and running.

I think the confusion stems from here. It seems that the readiness response incorrectly returns a 200 response (with value "ko", not sure whether it means anything to Kubernetes) even when the model hasn't finished loading yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfournioux Thanks for the great PR! I'm kind of new to Kubernetes so I'm a little confused here but it seems like the readiness probe is going to return a 200 OK response irrespective of whether the model is loaded or not right? I was under the assumption that the K8's probes check for status code and not necessarily responses? Should we be adding a test to see what it returns when the model is in fact not loaded?

Regarding your question "it seems like the readiness probe is going to return a 200 OK response irrespective of whether the model is loaded or not right?", in Kubernetes, when you configure your deployment, you can use startup probes to determinate when a container application has started. Liveness and readiness probes do not start until startup probes succeeds. It allows those probes not to interfere with your application startup. This is particularly useful when you have slow starting containers (for model loading for instance) and to do liveness checks on them. It will avoid them getting killed before they are up and running.

I think the confusion stems from here. It seems that the readiness response incorrectly returns a 200 response (with value "ko", not sure whether it means anything to Kubernetes) even when the model hasn't finished loading yet.

Ok I see, thanks for the clarification, the readiness probe should not return 200 if not ready, I correct it.

Copy link
Member

@DarkLight1337 DarkLight1337 Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if None is returned from the function, then 200 OK is still returned. You should return an error response (or whatever Kubernetes expects) explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if None is returned from the function, then 200 OK is still returned. You should return an error response (or whatever Kubernetes expects) explicitly.

Indeed, Kubernetes should expect any other code less than 200 and greater or equal to 400 indicates failure.



@router.post("/tokenize")
async def tokenize(request: TokenizeRequest):
Expand Down
28 changes: 28 additions & 0 deletions vllm/entrypoints/openai/protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -720,3 +720,31 @@ class DetokenizeRequest(OpenAIBaseModel):

class DetokenizeResponse(OpenAIBaseModel):
prompt: str

class LivenessResponse(OpenAIBaseModel):
"""Return object for liveness probe"""

alive: str = Field(None, title="Alive message")
model_config = {
"json_schema_extra": {
"examples": [
"liveness": {
"alive": "ok"
}
]
}
}

class ReadinessResponse(OpenAIBaseModel):
"""Return object for readiness probe"""

ready: str = Field(None, title="Ready message")
model_config = {
"json_schema_extra": {
"examples": [
"readiness": {
"ready": "ok"
}
]
}
}
Loading