More readme (#33)

predibase · Nov 16, 2023 · f493fb5 · f493fb5
1 parent d5bcc17
commit f493fb5
Showing 1 changed file with 42 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -14,48 +14,57 @@
 
 The LLM inference server that speaks for the GPUs!
 
-Lorax is a framework that allows users to serve over a hundred fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.
+LoRAX (LoRA eXchange) is a framework that allows users to serve over a hundred fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.
 
 ## 📖 Table of contents
 
-- [LoRA Exchange (LoRAX)](#lora-exchange-lorax)
+- [LoRAX (LoRA eXchange)](#lora-exchange-lorax)
   - [📖 Table of contents](#-table-of-contents)
   - [🔥 Features](#-features)
-  - [🏠 Optimized architectures](#-optimized-architectures)
+  - [🏠 Supported Models and Adapters](#-supported-models-and-adapters)
   - [🏃‍♂️ Get started](#️-get-started)
     - [Docker](#docker)
     - [📓 API documentation](#-api-documentation)
     - [🛠️ Local install](#️-local-install)
+    - [🙇 Acknowledgements](#-acknowledgements)
+    - [🗺️ Roadmap](#-roadmap)
 
 ## 🔥 Features
 
 - 🚅 **Dynamic Adapter Loading:** allowing each set of fine-tuned LoRA weights to be loaded from storage just-in-time as requests come in at runtime, without blocking concurrent requests.
 - 🏋️‍♀️ **Tiered Weight Caching:** to support fast exchanging of LoRA adapters between requests, and offloading of adapter weights to CPU and disk to avoid out-of-memory errors.
 - 🧁 **Continuous Multi-Adapter Batching:** a fair scheduling policy for optimizing aggregate throughput of the system that extends the popular continuous batching strategy to work across multiple sets of LoRA adapters in parallel.
-- 👬 **Optimized Inference:**  [flash-attention](https://github.com/HazyResearch/flash-attention), [paged attention](https://github.com/vllm-project/vllm), quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323), tensor parallelism, token streaming, and [continuous batching](https://github.com/predibase/lorax/tree/main/router) work together to optimize our inference speeds.
-- ✅ **Production Readiness** reliably stable, Lorax supports  Prometheus metrics and distributed tracing with Open Telemetry
-- 🤯 **Free Commercial Use:** Apache 2.0 License. Enough said 😎.
+- 👬 **Optimized Inference:**  high throughput and low latency optimizations including tensor parallelism, [continuous batching](https://github.com/predibase/lorax/tree/main/router) across different adapters, [flash-attention](https://github.com/HazyResearch/flash-attention), [paged attention](https://github.com/vllm-project/vllm), quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323), token streaming, weight prefetching and offloading.
+- 🚢  **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry.
+- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.
 
 
 <p align="center">
   <img src="https://github.com/predibase/lorax/assets/29719151/6f4f78fc-c1e9-4a01-8675-dbafa74a2534" />
 </p>
 
 
-## 🏠 Optimized architectures
+## 🏠 Supported Models and Adapters
 
-- 🦙 [Llama V2](https://huggingface.co/meta-llama)
+### Models
+
+- 🦙 [Llama](https://huggingface.co/meta-llama)
 - 🌬️[Mistral](https://huggingface.co/mistralai)
 
-Other architectures are supported on a best effort basis using:
+Other architectures are supported on a best effort basis, but do not support dynamical adapter loading.
+
+### Adapters
 
-`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`
+LoRAX currently supports LoRA adapters, which can be trained using frameworks like [PEFT](https://github.com/huggingface/peft) and [Ludwig](https://ludwig.ai/).
 
-or
+The following modules can be targeted:
 
-`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`
+- `q_proj`
+- `k_proj`
+- `v_proj`
+- `o_proj`
 
-## 🏃‍♂️ Get started
+## 🏃‍♂️ Getting started
 
 ### Docker
 
@@ -115,6 +124,24 @@ print(text)
 
 You can consult the OpenAPI documentation of the `lorax` REST API using the `/docs` route.
 
-### 🛠️ Local install
+### 🛠️ Local Development
+
+```
+# window 1 (server)
+make server-dev
+
+# window 2 (router)
+make router-dev
+```
+
+### 🙇 Acknowledgements
+
+LoRAX is built on top of HuggingFace's [text-generation-inference](https://github.com/huggingface/text-generation-inference), forked from v0.9.4 (Apache 2.0).
+
+### 🗺️ Roadmap
 
-MAGDY AND WAEL TODO
+[ ] Serve pretrained embedding models
+[ ] Serve embedding model MLP adapters
+[ ] Serve LLM MLP adapters for classification
+[ ] Blend multiple adapters per request
+[ ] SGMV kernel for adapters with different ranks