Bundle similar LLMs into a pool to avoid rate limit #6896

tqtensor · 2024-11-25T07:53:44Z

tqtensor
Nov 25, 2024

Hi,

We recently hit a deadlock with our system when we reached the rate limit for Llama 3 70B on Groq. It was a bit of a mess - the program kept retrying, which only made things worse by consuming more tokens and extending the wait time for the rate limit to reset.

As a quick fix, I bundled Llama 3, Llama 3.1, and Gemma 2 into a wrapper. The idea was that if we hit a rate limit error with one model, the system would temporarily switch to another. This works because, at least with Groq, the rate limits are per model.

Now, I made sure beforehand that switching between these models wouldn’t significantly affect our results. But I’m wondering if we could discuss a more formal way to implement this trick.

Think of it like a connection pool for databases, but instead, it’s a ‘rate limit pool’ for LLMs. How could we design this to be more robust and reusable?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bundle similar LLMs into a pool to avoid rate limit #6896

{{title}}

Replies: 0 comments

Select a reply

Bundle similar LLMs into a pool to avoid rate limit #6896

tqtensor Nov 25, 2024

Replies: 0 comments

tqtensor
Nov 25, 2024