You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would greatly appreciate it if the following improvements could be considered:
Improved Logging for Streamed Errors: For errors encountered in streaming mode, could the logging be made more user-friendly, similar to the non-streaming case? Displaying a clear error message and an indication of whether a retry will be attempted (like "Retrying request with num_retries: X") would significantly improve the debugging experience, instead of a full Python stack trace.
Consistent Retry Behavior: If an LLM call fails with a retryable error (like a 429) before any data has been streamed to the client, would it be possible for LiteLLM to initiate a retry, just as it does for non-streaming requests? This would provide a more consistent and robust user experience.
Thank you again for your time and consideration. I believe these changes would make LiteLLM even more resilient and easier to use, especially when working with models that have strict rate limits.
Motivation, pitch
Hello LiteLLM team,
First of all, thank you for developing and maintaining this useful library!
I'm currently using LiteLLM Proxy with the Gemini model (gemini/gemini-2.0-pro-exp-02-05). Due to the low rate limits and experimental nature of this model on Google's Vertex AI, I frequently encounter 429 errors. I've configured retries in LiteLLM, but I've observed inconsistent behavior in how retries are handled, specifically when dealing with streaming responses.
Observed Behavior:
Successful Retry (Non-streaming): When a non-streaming request encounters a 429 error, LiteLLM correctly initiates retries, as shown in the logs:
No Retry (Streaming): When a streaming request encounters a 429 error before any data has been sent to the client, the retry mechanism does not seem to be triggered. Instead, a lengthy Python stack trace is logged, making it difficult to quickly identify the issue:
09:49:32 - LiteLLM Proxy:ERROR: proxy_server.py:3038 - litellm.proxy.proxy_server.async_data_generator(): Exception occured - litellm.APIConnectionError: APIConnectionError: OpenAIException - litellm.RateLimitError: litellm.RateLimitError: VertexAIException - b'{\n "error": {\n "code": 429,\n "message": "Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.",\n "status": "RESOURCE_EXHAUSTED"\n }\n}\n'
Traceback (most recent call last):
File "/usr/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1545, in __anext__
async for chunk in self.completion_stream:
...<50 lines>...
return processed_chunk
File "/usr/lib/python3.13/site-packages/openai/_streaming.py", line 147, in __aiter__
async for item in self._iterator:
yield item
File "/usr/lib/python3.13/site-packages/openai/_streaming.py", line 174, in __stream__
raise APIError(
...<3 lines>...
)
openai.APIError: litellm.RateLimitError: litellm.RateLimitError: VertexAIException - b'{\n "error": {\n "code": 429,\n "message": "Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.",\n "status": "RESOURCE_EXHAUSTED"\n }\n}\n'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.13/site-packages/litellm/proxy/proxy_server.py", line 3017, in async_data_generator
async for chunk in response:
...<14 lines>...
yield f"data: {str(e)}\n\n"
File "/usr/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1700, in __anext__
raise exception_type(
~~~~~~~~~~~~~~^
model=self.model,
^^^^^^^^^^^^^^^^^
...<3 lines>...
extra_kwargs={},
^^^^^^^^^^^^^^^^
)
^
File "/usr/lib/python3.13/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 2206, in exception_type
raise e # it's already mapped
^^^^^^^
File "/usr/lib/python3.13/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 462, in exception_type
raise APIConnectionError(
...<7 lines>...
)
litellm.exceptions.APIConnectionError: litellm.APIConnectionError: APIConnectionError: OpenAIException - litellm.RateLimitError: litellm.RateLimitError: VertexAIException - b'{\n "error": {\n "code": 429,\n "message": "Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.",\n "status": "RESOURCE_EXHAUSTED"\n }\n}\n'
It took me some time (partially due to my limited familiarity with Python) to realize that the difference between successful and unsuccessful retries was related to whether the request was streaming or not.
Are you a ML Ops Team?
No
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered:
The Feature
I would greatly appreciate it if the following improvements could be considered:
Improved Logging for Streamed Errors: For errors encountered in streaming mode, could the logging be made more user-friendly, similar to the non-streaming case? Displaying a clear error message and an indication of whether a retry will be attempted (like "Retrying request with num_retries: X") would significantly improve the debugging experience, instead of a full Python stack trace.
Consistent Retry Behavior: If an LLM call fails with a retryable error (like a 429) before any data has been streamed to the client, would it be possible for LiteLLM to initiate a retry, just as it does for non-streaming requests? This would provide a more consistent and robust user experience.
Thank you again for your time and consideration. I believe these changes would make LiteLLM even more resilient and easier to use, especially when working with models that have strict rate limits.
Motivation, pitch
Hello LiteLLM team,
First of all, thank you for developing and maintaining this useful library!
I'm currently using LiteLLM Proxy with the Gemini model (
gemini/gemini-2.0-pro-exp-02-05
). Due to the low rate limits and experimental nature of this model on Google's Vertex AI, I frequently encounter 429 errors. I've configured retries in LiteLLM, but I've observed inconsistent behavior in how retries are handled, specifically when dealing with streaming responses.Observed Behavior:
Successful Retry (Non-streaming): When a non-streaming request encounters a 429 error, LiteLLM correctly initiates retries, as shown in the logs:
No Retry (Streaming): When a streaming request encounters a 429 error before any data has been sent to the client, the retry mechanism does not seem to be triggered. Instead, a lengthy Python stack trace is logged, making it difficult to quickly identify the issue:
It took me some time (partially due to my limited familiarity with Python) to realize that the difference between successful and unsuccessful retries was related to whether the request was streaming or not.
Are you a ML Ops Team?
No
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered: