Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎅 I WISH LITELLM HAD... #361

Open
krrishdholakia opened this issue Sep 13, 2023 · 211 comments
Open

🎅 I WISH LITELLM HAD... #361

krrishdholakia opened this issue Sep 13, 2023 · 211 comments

Comments

@krrishdholakia
Copy link
Contributor

krrishdholakia commented Sep 13, 2023

This is a ticket to track a wishlist of items you wish LiteLLM had.

COMMENT BELOW 👇

With your request 🔥 - if we have any questions, we'll follow up in comments / via DMs

Respond with ❤️ to any request you would also like to see

P.S.: Come say hi 👋 on the Discord

@krrishdholakia krrishdholakia pinned this issue Sep 13, 2023
@krrishdholakia
Copy link
Contributor Author

[LiteLLM Client] Add new models via UI

Thinking aloud it seems intuitive that you'd be able to add new models / remap completion calls to different models via UI. Unsure on real problem though.

@krrishdholakia
Copy link
Contributor Author

User / API Access Management

Different users have access to different models. It'd be helpful if there was a way to maybe leverage the BudgetManager to gate access. E.g. GPT-4 is expensive, i don't want to expose that to my free users but i do want my paid users to be able to use it.

@krrishdholakia
Copy link
Contributor Author

krrishdholakia commented Sep 13, 2023

cc: @yujonglee @WilliamEspegren @zakhar-kogan @ishaan-jaff @PhucTranThanh feel free to add any requests / ideas here.

@ishaan-jaff
Copy link
Contributor

ishaan-jaff commented Sep 13, 2023

[Spend Dashboard] View analytics for spend per llm and per user

  • This allows me to see what my most expensive llms are and what users are using litellm heavily

@ishaan-jaff
Copy link
Contributor

Auto select the best LLM for a given task

If it's a simple task like responding to "hello" litlellm should auto-select a cheaper but faster llm like j2-light

@Pipboyguy
Copy link

Integration with NLP Cloud

@krrishdholakia
Copy link
Contributor Author

That's awesome @Pipboyguy - dm'ing on linkedin to learn more!

@krrishdholakia krrishdholakia changed the title LiteLLM Wishlist 🎅 I WISH LITELLM ADDED... Sep 14, 2023
@krrishdholakia krrishdholakia changed the title 🎅 I WISH LITELLM ADDED... 🎅 I WISH LITELLM HAD... Sep 14, 2023
@krrishdholakia
Copy link
Contributor Author

krrishdholakia commented Sep 14, 2023

@ishaan-jaff check out this truncate param in the cohere api

This looks super interesting. Similar to your token trimmer. If the prompt exceeds context window, trim in a particular manner.
Screenshot 2023-09-14 at 10 54 50 AM

I would maybe only run trimming on user/assistant messages. Not touch the system prompt (works for RAG scenarios as well).

@haseeb-heaven
Copy link
Contributor

Option to use Inference API so we can use any model from Hugging Face 🤗

@krrishdholakia
Copy link
Contributor Author

krrishdholakia commented Sep 17, 2023

@haseeb-heaven you can already do this -

completion_url = f"https://api-inference.huggingface.co/models/{model}"

from litellm import completion 
response = completion(model="huggingface/gpt2", messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response) 

@haseeb-heaven
Copy link
Contributor

@haseeb-heaven you can already do this -

completion_url = f"https://api-inference.huggingface.co/models/{model}"

from litellm import completion 
response = completion(model="huggingface/gpt2", messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response) 

Wow great thanks its working. Nice feature

@smig23
Copy link

smig23 commented Sep 18, 2023

Support for inferencing using models hosted on Petals swarms (https://github.com/bigscience-workshop/petals), both public and private.

@ishaan-jaff
Copy link
Contributor

@smig23 what are you trying to use petals for ? We found it to be quite unstable and it would not consistently pass our tests

@shauryr
Copy link
Contributor

shauryr commented Sep 18, 2023

finetuning wrapper for openai, huggingface etc.

@krrishdholakia
Copy link
Contributor Author

@shauryr i created an issue to track this - feel free to add any missing details here

@smig23
Copy link

smig23 commented Sep 18, 2023

@smig23 what are you trying to use petals for ? We found it to be quite unstable and it would not consistently pass our tests

Specifically for my aims, I'm running a private swarm as a experiment with a view to implementing with in private organization, who have idle GPU resources, but it's distributed. The initial target would be inferencing and if litellm was able to be the abstraction layer, it would allow flexibility to go another direction with hosting in the future.

@ranjancse26
Copy link

I wish the litellm to have a direct support for finetuning the model. Based on the below blog post, I understand that in order to fine tune, one needs to have a specific understanding on the LLM provider and then follow their instructions or library for fine tuning the model. Why not the LiteLLM do all the abstraction and handle the fine-tuning aspects as well?

https://docs.litellm.ai/docs/tutorials/finetuned_chat_gpt
https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

@ranjancse26
Copy link

I wish LiteLLM has a support for open-source embeddings like sentence-transformers, hkunlp/instructor-large etc.

Sorry, based on the below documentation, it seems there's only support for the Open AI embedding.

https://docs.litellm.ai/docs/embedding/supported_embedding

@ranjancse26
Copy link

I wish LiteLLM has the integration to cerebrium platform. Please check the below link for the prebuilt-models.

https://docs.cerebrium.ai/cerebrium/prebuilt-models

@ishaan-jaff
Copy link
Contributor

@ranjancse26 what models on cerebrium do you want to use with LiteLLM ?

@ranjancse26
Copy link

@ishaan-jaff The cerebrium has got a lot of pre-built model. The focus should be on consuming the open-source models first ex: Lama 2, GPT4All, Falcon, FlanT5 etc. I am mentioning this as a first step. However, it's a good idea to have the Litellm take care of the internal communication with the custom-built models too. In-turn based on the API which the cerebrium is exposing.

image

@ishaan-jaff
Copy link
Contributor

@smig23 We've added support for petals to LiteLLM https://docs.litellm.ai/docs/providers/petals

@ranjancse26
Copy link

I wish Litellm has a built-in support for the majority of the provider operations than targeting the text generation alone. Consider an example of Cohere, the below one allows users to have conversations with a Large Language Model (LLM) from Cohere.

https://docs.cohere.com/reference/post_chat

@ranjancse26
Copy link

I wish Litellm has a ton of support and examples for users to develop apps with RAG pattern. It's kind of mandatory to go with the standard best practices and we all wish to have the same support.

@ranjancse26
Copy link

I wish Litellm has use-case driven examples for beginners. Keeping in mind of the day-to-day use-cases, it's a good idea to come up with a great sample which covers the following aspects.

  • Text classification
  • Text summarization
  • Text translation
  • Text generation
  • Code generation

@ranjancse26
Copy link

I wish Litellm to support for various known or popular vector db's. Here are couple of them to begin with.

  • Pinecone
  • Qdrant
  • Weaviate
  • Milvus
  • DuckDB
  • Sqlite

@ranjancse26
Copy link

ranjancse26 commented Sep 21, 2023

I wish Litellm has a built-in support for performing the web-scrapping or to get the real-time data using known provider like serpapi. It will be helpful for users to build the custom AI models or integrate with the LLMs for performing the retrieval augmented based generation.

https://serpapi.com/blog/llms-vs-serpapi/#serpapi-google-local-results-parser
https://colab.research.google.com/drive/1Q9VvVzjZJja7_y2Ls8qBkE_NApbLiqly?usp=sharing

@chukfinley
Copy link

Support for Awanllm

@pbasov
Copy link

pbasov commented Sep 12, 2024

SSO in open-source. It's a core feature.
https://sso.tax/

@gerasdf
Copy link

gerasdf commented Sep 13, 2024

You are fast! :-) I found that it has, thanks! :-)

For anybody looking into how to use images, here's an example (more on OpenAI's documentation). I've tried it with OpenAI and Anthropic, and it works.

from litellm import completion
messages=[{
        "role": "user","content": [
         {"type": "text", "text": "What’s in this image?"},
         {
           "type": "image_url",
           "image_url": {
             "url": 'https://framerusercontent.com/images/DbFHR1t71NpmaSxIqej1LKWzD4.png?scale-down-to=512',
           }
         }]
}]
response = completion(model="claude-3-haiku-20240307", messages=messages)
print(response.choices[0].message.content)

This image appears to be a diagram or schematic related to something called "LiteLLM". It shows a central blue circle or shape representing "LiteLLM", with various other shapes and logos branching out from it, including the letters "AAA", "co:here", and "PaLM 2". The diagram seems to illustrate connections or relationships between LiteLLM and these other entities, though I cannot determine the exact nature of those connections based solely on the visual information provided.

Already fulfilled wish:

I wish LiteLLM had support for completions with image INPUT, as OpenAI and Anthropic support

https://docs.anthropic.com/en/docs/build-with-claude/vision
https://platform.openai.com/docs/guides/vision

I'm only using LangChain for the adaptation layer... this feature kills me. I'm willing to implement, but I'd need some sort of guidance (as to what is expected in terms of functionality)

@isriam
Copy link

isriam commented Sep 17, 2024

Any plans to support model lists from say ollama or openai?

from openai import OpenAI
import os

client = OpenAI(
api_key = os.getenv('OPENAI_API_KEY')
)

models = client.models.list()

print(models)

@krrishdholakia
Copy link
Contributor Author

@yigitkonur
Copy link

For speech to text, fastest inference is fal.ai's whisper. It would be great to see fal.ai supported on LiteLLM.

@yigitkonur
Copy link

I love LiteLLM, but have criticism on docs.

The docs are clear, but could be improved. Have you considered AI-powered documentation?

For eg, Mintlify offers interesting AI features. You could explore similar options to enhance the docs. An LLM-based question system would help users find info more easily, especially for complex features. Thoughts on upgrading to a more interactive, user-friendly platform?

I built a CustomGPT for this purpose but I'm sure you can build something better than that: https://chatgpt.com/g/g-fDpe7KD7E-litellm

@royherma
Copy link

royherma commented Oct 1, 2024

Can you add support for Opik (https://github.com/comet-ml/opik)? I switched to it from Phoenix and like it much much more. It's fully open source but actually has a real UI (not it in a notebook). I'm manually instrumenting my code now but would be amazing if litellm had a native integration.

@dsblank
Copy link
Contributor

dsblank commented Oct 1, 2024

There is a PR for supporting Opik: #5680

@denisergashbaev
Copy link

I would like to have a throttler that blocks requests to a llm deployment if the ratelimit is hit. This should be managed by a global state (redis or something) so that if multiple people are using the same deployment via their own copy of the program their requests are throttled to respect the rate limit. By throttling i mean queueing: forcing the requests to wait in order to send a request.

I am aware of the discussion here #4510. However, I would want to see not just a load balancer (with rate limit based strategy) but a real queueing mechanism in the absence of deployments with free capacity.

@krrishdholakia
Copy link
Contributor Author

@denisergashbaev
Copy link

@denisergashbaev this exists - https://docs.litellm.ai/docs/scheduler

Hello Krish. I looked at the document. I actually do not want prioritization, i want queueing if the rate limit is reached

@krrishdholakia
Copy link
Contributor Author

krrishdholakia commented Oct 1, 2024

Hey @denisergashbaev the way it's implemented it does the queuing too - so it'll keep polling to check if a deployment is healthy unless the request times out / exceeds max retries -

## ADDS REQUEST TO QUEUE ##

How could our docs have been clearer here?

@rodion-m
Copy link

rodion-m commented Oct 2, 2024

Auto-parsing of OpenRouter's models and pricing from https://openrouter.ai/models and auto-updating of the file https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json

It should be easy.

@yigitkonur
Copy link

Do you have any plan to support Vercel AI SDK's stream protocol? It is very useful for most companies and using OpenAI streaming approach is limiting users for tool usage, generative UI and a lot more.

https://sdk.vercel.ai/docs/ai-sdk-ui/stream-protocol#data-stream-protocol

image
image

@GildeshAbhay
Copy link

Better Integration with langfuse's prompt management.
Iam not able to use "langfuse_prompt" with litellm

@krrishdholakia
Copy link
Contributor Author

@yigitkonur streaming with vercel sdk works with their openai integration currently

@GildeshAbhay replied on the issue you created - sample code for how you'd want this to work would be helpful

@WissamAntoun
Copy link

Support for Reranker API for Huggingface's Text Embedding Inference

@wesleyearlstander
Copy link

I wish litellm had module federation. With the fast-approaching era of real-time AI, loading only the necessary provider packages will be crucial in keeping system latency low.

@databill86
Copy link

Feature Request: Request Throttling/Queueing for Rate Limit Management

Related to @denisergashbaev's comments here and here which perfectly describes this need.

Desired Functionality

+1 to the request for a global throttling mechanism with queuing. To expand on @denisergashbaev's description:

  • When a deployment is approaching its rate limit (RPM/TPM)
  • Instead of failing or routing to another deployment
  • The requests should be queued and processed in order
  • Each request waits until it can be safely sent without exceeding the rate limit
  • Using a global state (e.g., Redis) to coordinate across multiple instances

Current Solutions vs Desired Behavior

Current: Request Prioritization

The current priority queue implementation (docs) focuses on prioritizing between requests but does not prevent rate limit errors. If there's only one deployment, requests will still fail when hitting rate limits rather than being queued.

Current: Usage-based Routing

The current routing strategy (docs) helps distribute load across multiple deployments but doesn't solve the fundamental issue of managing rate limits through queuing.

Example Use Case

router = Router(
model_list=[{
"model_name": "gpt-3.5-turbo",
"litellm_params": {
"model": "openai-tier1",
"rpm": 60 # 60 requests per minute
}
}]
)

Desired behavior: If 100 requests come in within a minute

  • First 60 requests process normally
  • Next 40 requests are queued
  • Queued requests automatically process as capacity becomes available
  • No rate limit errors are thrown

Benefits

  1. More reliable systems - no rate limit errors
  2. Simpler implementation for users - no need to handle rate limit errors
  3. Works even with single deployment scenarios
  4. Prevents rate limit exhaustion: Currently, simple retries (like RateLimitErrorRetries) can actually worsen the situation by accumulating failed requests and further exhausting rate limits. Even with router_settings like retry_after or timeout, we can still have these problems. A proper queuing system would handle this gracefully, ensuring failed requests don't compound the rate limit problem.

This feature would be incredibly valuable for the community, as evidenced by multiple users requesting similar functionality. LiteLLM is already an amazing tool for LLM deployment management, and this addition would make it even more robust for production use cases.

This feature would be incredibly valuable for the community. LiteLLM is already an amazing tool, I'm still testing it in multiple scenarios, but I think this addition would make it even more robust for production use cases.

@krrishdholakia
Copy link
Contributor Author

@databill86 requests which fail due to rate limit errors are kept in queue and retried until the timeout for the request is hit

@databill86
Copy link

@databill86 requests which fail due to rate limit errors are kept in queue and retried until the timeout for the request is hit

Thanks for the response! However, there's a crucial distinction to make here.

The current retry mechanism can actually worsen rate limit issues, particularly with OpenAI:

  1. Failed requests count against limits: As per OpenAI's documentation, unsuccessful requests still contribute to your per-minute limit. Simply retrying failed requests (even from a queue) will:

    • Count against your rate limits
    • Potentially trigger more rate limits
    • Leave you unable to process any requests for a longer period
  2. What we need instead:

    • A global state tracking current TPM/RPM usage
    • Process requests from queue only when we know capacity is available
    • Intelligent throttling when multiple requests arrive during wait periods
    • Avoid sending all queued requests at once when capacity becomes available

The key difference is proactive vs reactive handling:

  • Current: Reactive - Wait for failures, then retry (which counts against limits)
  • Needed: Proactive - Track usage and only send requests when we know they won't exceed limits

This would provide much better resource utilization and prevent the "cascade effect" where retries compound the rate limit problem.

@krrishdholakia
Copy link
Contributor Author

Failed requests count against limits: As per OpenAI's documentation, unsuccessful requests still contribute to your per-minute limit. Simply retrying failed requests (even from a queue) will

we wait based on the retry-after header present in the rate limit error, so we don't trigger this issue. Here's the test -

assert int(response_headers["retry-after"]) == cooldown_time

Needed: Proactive - Track usage and only send requests when we know they won't exceed limits

this already exists. use rate limit aware routing - https://docs.litellm.ai/docs/routing#advanced---routing-strategies-%EF%B8%8F

@lazariv
Copy link

lazariv commented Nov 14, 2024

Allow configuring API-baseurl for audio/speech endpoint.

Currently only OpenAI, Azure and Vertex are supported. That would be nice to allow configuring the api_base parameter to allow self-hosted TTS engines (with OpenAI API), such as https://github.com/matatonic/openedai-speech , to be used by setting e.g.:

- model_name: tts
  litellm_params:
    model: openai/tts-1
    api_base: https://local/tts/engine
    api_key: os.environ/OPENAI_API_KEY

@lazariv
Copy link

lazariv commented Nov 15, 2024

Groups of models

Provide a possibility to create groups of models (e.g. "Free tier models", "Public models", etc.), so that a specific virtual key can be given access to such groups.

Currently virtual key can be given access only per team, which doesn't scale if many teams are present, and adding a new public model requires to edit all teams.

@regismesquita
Copy link

I would love to be able to have the citations field included in the response body when using Perplexity. Currently, I was able to achieve this for non-streaming responses using the success hook, but I had no luck with streaming responses.

@krrishdholakia
Copy link
Contributor Author

@lazariv this already exists - https://docs.litellm.ai/docs/proxy/tag_routing

@derekalia
Copy link

Pixtral vision support - mistralai/Pixtral-Large-Instruct-2411

@jtsai-quid
Copy link

Adding tokenize and detokenize to the llm utils endpoints, please 🙏

@Tomato6966
Copy link

I wish litellm would support: "updating assistants" through "PATCH /assistants/:assistantId", deleting Threads through "DELETE /threads/:threadId".

Else: Very great project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests