Skip to content

Commit

Permalink
docs: update README - advanced usage
Browse files Browse the repository at this point in the history
  • Loading branch information
tpietruszka committed Aug 17, 2023
1 parent ee06f3f commit b2ef469
Showing 1 changed file with 59 additions and 9 deletions.
68 changes: 59 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,24 +23,33 @@ those of Large Language Models (LLMs).
- handles timeouts (requests will not hang forever)
- raises an exception if the request fails (or the server returns an error / an "invalid" response)
- requests are independent from each other, do not rely on order, can be retried if failed
- the server verifies rate limits when it gets the request, records usage of resources upon
completion - right before returning the results (this is a pessimistic assumption).
- we want a standard, simple interface for the user - working the same way in a script and in a
notebook (+ most data scientists do not want to deal with asyncio). Therefore, `Runner.run()` is
a blocking call, with the same behavior regardless of the context.
- async use is also supported - via `run_coro()` - but it will not support `KeyboardInterrupt`
handling.

## Installation
```shell
pip install rate_limited
```

## Usage
In short:
- wrap your function with a `Runner()`, describing the rate limits you have
- call `Runner.schedule()` to schedule a request
- call `Runner.schedule()` to schedule a request - with the same arguments as you would call the
original function
- call `Runner.run()` to run the scheduled requests, get the results and any exceptions raised

### Installation
```shell
pip install rate_limited
```
### Creating a Runner
The following arguments are required:
- `function` - the function to be called
- `resources` - a list of `Resource` objects, describing the rate limits you have (see examples below)
- `max_concurrent` - the maximum number of requests to be executed in parallel
Important optional arguments:
- `max_retries` - the maximum number of retries for a single request (default: 5)
- `validation_function` - a function that validates the response and returns `True` if it is valid
(e.g. conforms to the schema you expect). If not valid, the request will be retried.

### OpenAI example
```python
Expand Down Expand Up @@ -107,6 +116,46 @@ resources = [
]
```

### More complex resource descriptions

Overall, the core of an API description is a list of `Resource` objects, each describing a single
resource and its limit - e.g. "requests per minute", "tokens per minute", "images per hour", etc.

Each resource has:
- a name (just for logging/debugging)
- a quota (e.g. 100)
- a time window (e.g. 60 seconds)
- functions that extract the "billing" information:
- `arguments_usage_extractor`
- `results_usage_extractor`
- `max_results_usage_estimator`

Two distinct "billing" models are supported:
- Before the call - we register usage before making the call, based on the arguments of the call.

In these cases, just `arguments_usage_extractor` is needed.
- After the call - we register usage after making the call, based on the results of the call,
and possibly the arguments (if needed for some complex cases).

To avoid sending a flood or requests before the first ones complete (and we register any usage)
we need to estimate the maximum usage of each call, based on its arguments. We
"pre-allocate" this usage, then register the actual usage after the call completes.

In these cases, `results_usage_extractor` and `max_results_usage_estimator` are needed.

Both approaches can be used in the same Resource description if needed.

Note: it is assumed that resource usage "expires" fully after the time window elapses, without a
"gradual" decline. Some APIs (OpenAI) might use the "gradual" approach, and differ in other details,
but this approach seems sufficient to get "close enough" to the actual rate limits, without running
into them.

See [apis.openai.chat](./rate_limited/apis/openai/chat.py) for an example of a more complex API
description with multiple resources.




### Limiting the number of concurrent requests

The `max_concurrent` argument of `Runner` controls the number of concurrent requests.
Expand Down Expand Up @@ -159,14 +208,15 @@ flake8 && black --check . && mypy .
```

## TODOs:
- improved README and usage examples
- more ready-made API descriptions - incl. batched ones?
- fix the "interrupt and resume" test in Python 3.11
### Nice to have:
- (optional) slow start feature - pace the initial requests, instead of sending them all at once
- text-based logging if tqdm is not installed
- if/where possible, detect RateLimitExceeded - notify the user, slow down
- should runner.schedule return futures and enable "streaming" of results?
- support "streaming" and/or continuous operation:
- enable scheduling calls while running and/or getting inputs from generators
- support "streaming" results - perhaps similar to "as_completed" in asyncio?
- add timeouts option? (for now, the user is responsible for handling timeouts)
- OpenAI shares information about rate limits in http response headers - could it be used without
coupling too tightly with their API?
Expand Down

0 comments on commit b2ef469

Please sign in to comment.