readme fixes

microsoft · Jan 18, 2024 · 89033a8 · 89033a8
1 parent f1ccaf4
commit 89033a8
Showing 1 changed file with 7 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -139,26 +139,31 @@ Most of computation in AICI Controllers occurs on the CPU, in parallel with the
 This allows for 20-50ms of CPU time for typical models and GPUs.
 With careful engineering,
 this is more than enough to compute the set of allowed tokens in Rust compiled to Wasm.
-The JavaScript or Python code is then used to glue together such constraints.
+These can be combined either nativelly in Rust, or via Python or JavaScript interpreters
+we provide.
 
 For example, computing allowed token set in the 32000-strong vocabulary of Llama model takes:
+
 - about 2.0ms for Yacc grammar of the C programming language
 - about 0.3ms for a regular expression
 - about 0.2ms for a substring contraint, from 4kB string
+
 The above numbers are for a single sequeance, however each sequence is processed in separate process,
 and thus if there is more cores than sequances (which is typical), they are generally applicable.
 They also include overhead of calling into Python interpreter implemented in Wasm, and then back into 
 Rust-generated Wasm code for the constraint itself.
+They are all well within the 20-50ms budget, so do not affect the generation time at all.
 
 There is also some overhead in the critical path of sampling. It comes down to about 0.3ms per token
-when executing 10 sequences in parallel.
+when executing 10 sequences in parallel (this is irrespective of the constraint used).
 The overhead goes up to around 0.7ms for 40 sequences (though it has not been fully optimized yet).
 
 All measurements done on AMD EPYC 7V13 with nVidia A100 GPU with 80GB of VRAM.
 
 ## Flexibility
 
 The low-level interface that AICI runtime provides allows for:
+
 - interaction with the LLM inference engine before, during, and after every generated token
 - constraining decoding to a set of tokens
 - backtracking KV-cache to a previous state