From b2ef4697289ea6edba7587b2b795d237d1188822 Mon Sep 17 00:00:00 2001 From: Tomasz Pietruszka Date: Thu, 17 Aug 2023 18:27:15 +0100 Subject: [PATCH] docs: update README - advanced usage --- README.md | 68 +++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 59 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index 756e901..81ade0b 100644 --- a/README.md +++ b/README.md @@ -23,24 +23,33 @@ those of Large Language Models (LLMs). - handles timeouts (requests will not hang forever) - raises an exception if the request fails (or the server returns an error / an "invalid" response) - requests are independent from each other, do not rely on order, can be retried if failed -- the server verifies rate limits when it gets the request, records usage of resources upon - completion - right before returning the results (this is a pessimistic assumption). - we want a standard, simple interface for the user - working the same way in a script and in a notebook (+ most data scientists do not want to deal with asyncio). Therefore, `Runner.run()` is a blocking call, with the same behavior regardless of the context. - async use is also supported - via `run_coro()` - but it will not support `KeyboardInterrupt` handling. +## Installation +```shell +pip install rate_limited +``` + ## Usage In short: - wrap your function with a `Runner()`, describing the rate limits you have -- call `Runner.schedule()` to schedule a request +- call `Runner.schedule()` to schedule a request - with the same arguments as you would call the + original function - call `Runner.run()` to run the scheduled requests, get the results and any exceptions raised -### Installation -```shell -pip install rate_limited -``` +### Creating a Runner +The following arguments are required: +- `function` - the function to be called +- `resources` - a list of `Resource` objects, describing the rate limits you have (see examples below) +- `max_concurrent` - the maximum number of requests to be executed in parallel +Important optional arguments: +- `max_retries` - the maximum number of retries for a single request (default: 5) +- `validation_function` - a function that validates the response and returns `True` if it is valid + (e.g. conforms to the schema you expect). If not valid, the request will be retried. ### OpenAI example ```python @@ -107,6 +116,46 @@ resources = [ ] ``` +### More complex resource descriptions + +Overall, the core of an API description is a list of `Resource` objects, each describing a single +resource and its limit - e.g. "requests per minute", "tokens per minute", "images per hour", etc. + +Each resource has: +- a name (just for logging/debugging) +- a quota (e.g. 100) +- a time window (e.g. 60 seconds) +- functions that extract the "billing" information: + - `arguments_usage_extractor` + - `results_usage_extractor` + - `max_results_usage_estimator` + +Two distinct "billing" models are supported: +- Before the call - we register usage before making the call, based on the arguments of the call. + + In these cases, just `arguments_usage_extractor` is needed. +- After the call - we register usage after making the call, based on the results of the call, + and possibly the arguments (if needed for some complex cases). + + To avoid sending a flood or requests before the first ones complete (and we register any usage) + we need to estimate the maximum usage of each call, based on its arguments. We + "pre-allocate" this usage, then register the actual usage after the call completes. + + In these cases, `results_usage_extractor` and `max_results_usage_estimator` are needed. + +Both approaches can be used in the same Resource description if needed. + +Note: it is assumed that resource usage "expires" fully after the time window elapses, without a +"gradual" decline. Some APIs (OpenAI) might use the "gradual" approach, and differ in other details, +but this approach seems sufficient to get "close enough" to the actual rate limits, without running +into them. + +See [apis.openai.chat](./rate_limited/apis/openai/chat.py) for an example of a more complex API +description with multiple resources. + + + + ### Limiting the number of concurrent requests The `max_concurrent` argument of `Runner` controls the number of concurrent requests. @@ -159,14 +208,15 @@ flake8 && black --check . && mypy . ``` ## TODOs: -- improved README and usage examples - more ready-made API descriptions - incl. batched ones? - fix the "interrupt and resume" test in Python 3.11 ### Nice to have: - (optional) slow start feature - pace the initial requests, instead of sending them all at once - text-based logging if tqdm is not installed - if/where possible, detect RateLimitExceeded - notify the user, slow down -- should runner.schedule return futures and enable "streaming" of results? +- support "streaming" and/or continuous operation: + - enable scheduling calls while running and/or getting inputs from generators + - support "streaming" results - perhaps similar to "as_completed" in asyncio? - add timeouts option? (for now, the user is responsible for handling timeouts) - OpenAI shares information about rate limits in http response headers - could it be used without coupling too tightly with their API?