Skip to content

Commit

Permalink
readme fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
mmoskal committed Jan 18, 2024
1 parent f1ccaf4 commit 89033a8
Showing 1 changed file with 7 additions and 2 deletions.
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,26 +139,31 @@ Most of computation in AICI Controllers occurs on the CPU, in parallel with the
This allows for 20-50ms of CPU time for typical models and GPUs.
With careful engineering,
this is more than enough to compute the set of allowed tokens in Rust compiled to Wasm.
The JavaScript or Python code is then used to glue together such constraints.
These can be combined either nativelly in Rust, or via Python or JavaScript interpreters
we provide.

For example, computing allowed token set in the 32000-strong vocabulary of Llama model takes:

- about 2.0ms for Yacc grammar of the C programming language
- about 0.3ms for a regular expression
- about 0.2ms for a substring contraint, from 4kB string

The above numbers are for a single sequeance, however each sequence is processed in separate process,
and thus if there is more cores than sequances (which is typical), they are generally applicable.
They also include overhead of calling into Python interpreter implemented in Wasm, and then back into
Rust-generated Wasm code for the constraint itself.
They are all well within the 20-50ms budget, so do not affect the generation time at all.

There is also some overhead in the critical path of sampling. It comes down to about 0.3ms per token
when executing 10 sequences in parallel.
when executing 10 sequences in parallel (this is irrespective of the constraint used).
The overhead goes up to around 0.7ms for 40 sequences (though it has not been fully optimized yet).

All measurements done on AMD EPYC 7V13 with nVidia A100 GPU with 80GB of VRAM.

## Flexibility

The low-level interface that AICI runtime provides allows for:

- interaction with the LLM inference engine before, during, and after every generated token
- constraining decoding to a set of tokens
- backtracking KV-cache to a previous state
Expand Down

0 comments on commit 89033a8

Please sign in to comment.