[Feature]: Improving Retry Mechanism Consistency and Logging for Streamed Responses in LiteLLM Proxy #8648

fengjiajie · 2025-02-19T04:01:12Z

The Feature

I would greatly appreciate it if the following improvements could be considered:

Improved Logging for Streamed Errors: For errors encountered in streaming mode, could the logging be made more user-friendly, similar to the non-streaming case? Displaying a clear error message and an indication of whether a retry will be attempted (like "Retrying request with num_retries: X") would significantly improve the debugging experience, instead of a full Python stack trace.
Consistent Retry Behavior: If an LLM call fails with a retryable error (like a 429) before any data has been streamed to the client, would it be possible for LiteLLM to initiate a retry, just as it does for non-streaming requests? This would provide a more consistent and robust user experience.

Thank you again for your time and consideration. I believe these changes would make LiteLLM even more resilient and easier to use, especially when working with models that have strict rate limits.

Motivation, pitch

Hello LiteLLM team,

First of all, thank you for developing and maintaining this useful library!

I'm currently using LiteLLM Proxy with the Gemini model (gemini/gemini-2.0-pro-exp-02-05). Due to the low rate limits and experimental nature of this model on Google's Vertex AI, I frequently encounter 429 errors. I've configured retries in LiteLLM, but I've observed inconsistent behavior in how retries are handled, specifically when dealing with streaming responses.

Observed Behavior:

Successful Retry (Non-streaming): When a non-streaming request encounters a 429 error, LiteLLM correctly initiates retries, as shown in the logs:

09:51:13 - LiteLLM Router:INFO: router.py:983 - litellm.acompletion(model=gemini/gemini-2.0-pro-exp-02-05) Exception litellm.RateLimitError: litellm.RateLimitError: VertexAIException - {
  "error": {
    "code": 429,
    "message": "Resource has been exhausted (e.g. check quota).",
    "status": "RESOURCE_EXHAUSTED"
  }
}

09:51:13 - LiteLLM Router:INFO: router.py:3151 - Retrying request with num_retries: 3

No Retry (Streaming): When a streaming request encounters a 429 error before any data has been sent to the client, the retry mechanism does not seem to be triggered. Instead, a lengthy Python stack trace is logged, making it difficult to quickly identify the issue:

09:49:32 - LiteLLM Proxy:ERROR: proxy_server.py:3038 - litellm.proxy.proxy_server.async_data_generator(): Exception occured - litellm.APIConnectionError: APIConnectionError: OpenAIException - litellm.RateLimitError: litellm.RateLimitError: VertexAIException - b'{\n  "error": {\n    "code": 429,\n    "message": "Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.",\n    "status": "RESOURCE_EXHAUSTED"\n  }\n}\n'
Traceback (most recent call last):
  File "/usr/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1545, in __anext__
    async for chunk in self.completion_stream:
    ...<50 lines>...
        return processed_chunk
  File "/usr/lib/python3.13/site-packages/openai/_streaming.py", line 147, in __aiter__
    async for item in self._iterator:
        yield item
  File "/usr/lib/python3.13/site-packages/openai/_streaming.py", line 174, in __stream__
    raise APIError(
    ...<3 lines>...
    )
openai.APIError: litellm.RateLimitError: litellm.RateLimitError: VertexAIException - b'{\n  "error": {\n    "code": 429,\n    "message": "Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.",\n    "status": "RESOURCE_EXHAUSTED"\n  }\n}\n'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.13/site-packages/litellm/proxy/proxy_server.py", line 3017, in async_data_generator
    async for chunk in response:
    ...<14 lines>...
            yield f"data: {str(e)}\n\n"
  File "/usr/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1700, in __anext__
    raise exception_type(
          ~~~~~~~~~~~~~~^
        model=self.model,
        ^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        extra_kwargs={},
        ^^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/lib/python3.13/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 2206, in exception_type
    raise e  # it's already mapped
    ^^^^^^^
  File "/usr/lib/python3.13/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 462, in exception_type
    raise APIConnectionError(
    ...<7 lines>...
    )
litellm.exceptions.APIConnectionError: litellm.APIConnectionError: APIConnectionError: OpenAIException - litellm.RateLimitError: litellm.RateLimitError: VertexAIException - b'{\n  "error": {\n    "code": 429,\n    "message": "Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.",\n    "status": "RESOURCE_EXHAUSTED"\n  }\n}\n'

It took me some time (partially due to my limited familiarity with Python) to realize that the difference between successful and unsuccessful retries was related to whether the request was streaming or not.

Are you a ML Ops Team?

No

Twitter / LinkedIn details

No response

The text was updated successfully, but these errors were encountered:

fengjiajie added the enhancement New feature or request label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Improving Retry Mechanism Consistency and Logging for Streamed Responses in LiteLLM Proxy #8648

[Feature]: Improving Retry Mechanism Consistency and Logging for Streamed Responses in LiteLLM Proxy #8648

fengjiajie commented Feb 19, 2025

[Feature]: Improving Retry Mechanism Consistency and Logging for Streamed Responses in LiteLLM Proxy #8648

[Feature]: Improving Retry Mechanism Consistency and Logging for Streamed Responses in LiteLLM Proxy #8648

Comments

fengjiajie commented Feb 19, 2025

The Feature

Motivation, pitch

Are you a ML Ops Team?

Twitter / LinkedIn details