From 26788d01e62d7e3e783236954bd0d1313a13bae1 Mon Sep 17 00:00:00 2001 From: Travis Addair Date: Mon, 5 Feb 2024 18:53:42 -0800 Subject: [PATCH] Update README to include model merging (#225) --- README.md | 2 +- docs/index.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 36820bc41..c81986d1f 100644 --- a/README.md +++ b/README.md @@ -32,7 +32,7 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin ## 🌳 Features -- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter in your request, it will be loaded just-in-time without blocking concurrent requests. +- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](https://predibase.github.io/lorax/models/adapters/#huggingface-hub), [Predibase](https://predibase.github.io/lorax/models/adapters/#predibase), or [any filesystem](https://predibase.github.io/lorax/models/adapters/#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](https://predibase.github.io/lorax/guides/merging_adapters/) per request to instantly create powerful ensembles. - 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters. - 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system. - 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming. diff --git a/docs/index.md b/docs/index.md index cefe4df12..983960327 100644 --- a/docs/index.md +++ b/docs/index.md @@ -27,7 +27,7 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin ## 🌳 Features -- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter in your request, it will be loaded just-in-time without blocking concurrent requests. +- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](./models/adapters.md#huggingface-hub), [Predibase](./models/adapters.md#predibase), or [any filesystem](./models/adapters.md#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](./guides/merging_adapters.md) per request to instantly create powerful ensembles. - 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters. - 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system. - 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.