From 89033a846e0927d6814728f16a673a4e8e6a3955 Mon Sep 17 00:00:00 2001 From: Michal Moskal Date: Thu, 18 Jan 2024 01:52:02 +0000 Subject: [PATCH] readme fixes --- README.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 037adb95..780b18a7 100644 --- a/README.md +++ b/README.md @@ -139,19 +139,23 @@ Most of computation in AICI Controllers occurs on the CPU, in parallel with the This allows for 20-50ms of CPU time for typical models and GPUs. With careful engineering, this is more than enough to compute the set of allowed tokens in Rust compiled to Wasm. -The JavaScript or Python code is then used to glue together such constraints. +These can be combined either nativelly in Rust, or via Python or JavaScript interpreters +we provide. For example, computing allowed token set in the 32000-strong vocabulary of Llama model takes: + - about 2.0ms for Yacc grammar of the C programming language - about 0.3ms for a regular expression - about 0.2ms for a substring contraint, from 4kB string + The above numbers are for a single sequeance, however each sequence is processed in separate process, and thus if there is more cores than sequances (which is typical), they are generally applicable. They also include overhead of calling into Python interpreter implemented in Wasm, and then back into Rust-generated Wasm code for the constraint itself. +They are all well within the 20-50ms budget, so do not affect the generation time at all. There is also some overhead in the critical path of sampling. It comes down to about 0.3ms per token -when executing 10 sequences in parallel. +when executing 10 sequences in parallel (this is irrespective of the constraint used). The overhead goes up to around 0.7ms for 40 sequences (though it has not been fully optimized yet). All measurements done on AMD EPYC 7V13 with nVidia A100 GPU with 80GB of VRAM. @@ -159,6 +163,7 @@ All measurements done on AMD EPYC 7V13 with nVidia A100 GPU with 80GB of VRAM. ## Flexibility The low-level interface that AICI runtime provides allows for: + - interaction with the LLM inference engine before, during, and after every generated token - constraining decoding to a set of tokens - backtracking KV-cache to a previous state