Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Improving Retry Mechanism Consistency and Logging for Streamed Responses in LiteLLM Proxy #8648

Open
fengjiajie opened this issue Feb 19, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@fengjiajie
Copy link
Contributor

The Feature

I would greatly appreciate it if the following improvements could be considered:

  1. Improved Logging for Streamed Errors: For errors encountered in streaming mode, could the logging be made more user-friendly, similar to the non-streaming case? Displaying a clear error message and an indication of whether a retry will be attempted (like "Retrying request with num_retries: X") would significantly improve the debugging experience, instead of a full Python stack trace.

  2. Consistent Retry Behavior: If an LLM call fails with a retryable error (like a 429) before any data has been streamed to the client, would it be possible for LiteLLM to initiate a retry, just as it does for non-streaming requests? This would provide a more consistent and robust user experience.

Thank you again for your time and consideration. I believe these changes would make LiteLLM even more resilient and easier to use, especially when working with models that have strict rate limits.

Motivation, pitch

Hello LiteLLM team,

First of all, thank you for developing and maintaining this useful library!

I'm currently using LiteLLM Proxy with the Gemini model (gemini/gemini-2.0-pro-exp-02-05). Due to the low rate limits and experimental nature of this model on Google's Vertex AI, I frequently encounter 429 errors. I've configured retries in LiteLLM, but I've observed inconsistent behavior in how retries are handled, specifically when dealing with streaming responses.

Observed Behavior:

  • Successful Retry (Non-streaming): When a non-streaming request encounters a 429 error, LiteLLM correctly initiates retries, as shown in the logs:

    09:51:13 - LiteLLM Router:INFO: router.py:983 - litellm.acompletion(model=gemini/gemini-2.0-pro-exp-02-05) Exception litellm.RateLimitError: litellm.RateLimitError: VertexAIException - {
      "error": {
        "code": 429,
        "message": "Resource has been exhausted (e.g. check quota).",
        "status": "RESOURCE_EXHAUSTED"
      }
    }
    
    09:51:13 - LiteLLM Router:INFO: router.py:3151 - Retrying request with num_retries: 3
    
  • No Retry (Streaming): When a streaming request encounters a 429 error before any data has been sent to the client, the retry mechanism does not seem to be triggered. Instead, a lengthy Python stack trace is logged, making it difficult to quickly identify the issue:

    09:49:32 - LiteLLM Proxy:ERROR: proxy_server.py:3038 - litellm.proxy.proxy_server.async_data_generator(): Exception occured - litellm.APIConnectionError: APIConnectionError: OpenAIException - litellm.RateLimitError: litellm.RateLimitError: VertexAIException - b'{\n  "error": {\n    "code": 429,\n    "message": "Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.",\n    "status": "RESOURCE_EXHAUSTED"\n  }\n}\n'
    Traceback (most recent call last):
      File "/usr/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1545, in __anext__
        async for chunk in self.completion_stream:
        ...<50 lines>...
            return processed_chunk
      File "/usr/lib/python3.13/site-packages/openai/_streaming.py", line 147, in __aiter__
        async for item in self._iterator:
            yield item
      File "/usr/lib/python3.13/site-packages/openai/_streaming.py", line 174, in __stream__
        raise APIError(
        ...<3 lines>...
        )
    openai.APIError: litellm.RateLimitError: litellm.RateLimitError: VertexAIException - b'{\n  "error": {\n    "code": 429,\n    "message": "Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.",\n    "status": "RESOURCE_EXHAUSTED"\n  }\n}\n'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/lib/python3.13/site-packages/litellm/proxy/proxy_server.py", line 3017, in async_data_generator
        async for chunk in response:
        ...<14 lines>...
                yield f"data: {str(e)}\n\n"
      File "/usr/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1700, in __anext__
        raise exception_type(
              ~~~~~~~~~~~~~~^
            model=self.model,
            ^^^^^^^^^^^^^^^^^
        ...<3 lines>...
            extra_kwargs={},
            ^^^^^^^^^^^^^^^^
        )
        ^
      File "/usr/lib/python3.13/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 2206, in exception_type
        raise e  # it's already mapped
        ^^^^^^^
      File "/usr/lib/python3.13/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 462, in exception_type
        raise APIConnectionError(
        ...<7 lines>...
        )
    litellm.exceptions.APIConnectionError: litellm.APIConnectionError: APIConnectionError: OpenAIException - litellm.RateLimitError: litellm.RateLimitError: VertexAIException - b'{\n  "error": {\n    "code": 429,\n    "message": "Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.",\n    "status": "RESOURCE_EXHAUSTED"\n  }\n}\n'
    

    It took me some time (partially due to my limited familiarity with Python) to realize that the difference between successful and unsuccessful retries was related to whether the request was streaming or not.

Are you a ML Ops Team?

No

Twitter / LinkedIn details

No response

@fengjiajie fengjiajie added the enhancement New feature or request label Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant