From 8f9da2599b8c4676aefc2f2e40a5571e559d097c Mon Sep 17 00:00:00 2001
From: Devis Lucato <devis@microsoft.com>
Date: Tue, 6 Feb 2024 17:42:37 -0800
Subject: [PATCH] Update README, walk through example end to end

---
 README.md | 365 +++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 253 insertions(+), 112 deletions(-)

diff --git a/README.md b/README.md
index 7ed0e4bc..61f3d138 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 # Artificial Intelligence Controller Interface (AICI)
 
 The Artificial Intelligence Controller Interface (AICI) lets you build Controllers that constrain and direct output of a Large Language Model (LLM) in real time.
-Controllers are flexible programs capable of implementating constrained decoding, dynamic editing of prompts and generated text, and coordinating execution across multiple, parallel generations.
+Controllers are flexible programs capable of implementing constrained decoding, dynamic editing of prompts and generated text, and coordinating execution across multiple, parallel generations.
 Controllers incorporate custom logic during the token-by-token decoding and maintain state during an LLM request. This allows diverse Controller strategies, from programmatic or query-based decoding to multi-agent conversations to execute efficiently in tight integration with the LLM itself.
 
 **The purpose of AICI is to make it easy to build and experiment with both existing and entirely new Controller strategies for improving LLM generations.**
@@ -25,162 +25,303 @@ AICI is a prototype, designed and built at [Microsoft Research](https://www.micr
 > [!TIP]
 > We are [looking for a research intern](https://jobs.careers.microsoft.com/us/en/job/1659267). You have to be accepted or currently enrolled in a PhD program or an equivalent research-oriented program in Computer Science or related STEM field.
 
-## Getting started
+# QuickStart: Example Walkthrough
 
-This repository contains a number of components, and which ones you need depends on your use case.
+In this quickstart, we'll guide you through the following steps:
 
-You can **use an existing controller module**.
-We provide [PyCtrl](./pyctrl) and [JsCtrl](./jsctrl)
-that let you script controllers using server-side Python and JavaScript, respectively.
-The [pyaici](./pyaici) package contains `aici` command line tool that lets you
-[upload and run scripts](./proxy.md) with any controller
-(we also provide [REST API definition](./REST.md) for the curious).
-> 🧑‍💻[Python code samples for scripting PyCtrl](./pyctrl) and a [JavaScript Hello World for JSCtrl](./jsctrl/samples/hello.js)
+* Setting up **rLLM Server** and **AICI Runtime**.
+* Building and deploying a **Controller**.
+* Utilizing AICI to control LLM output, enabling the customization of an LLM to **generate text adhering to specific rules**.
 
-We anticipate [libraries](#architecture) will be built on top of controllers.
-We provide an example in [promptlib](./promptlib) - a client-side Python library
-that generates interacts with [DeclCtrl](./declctrl) via the pyaici package.
-> 🧑‍💻 [Example notebook that uses PromptLib to interact with DeclCtrl](./promptlib/notebooks/basics_tutorial.ipynb).
+## Development Environment Setup
 
-The controllers can be run in a cloud or local AICI-enabled LLM inference engine.
-You can **run the provided reference engine (rLLM) locally** with either
-[libtorch+CUDA](./rllm-cuda) or [llama.cpp backend](./rllm-cpp).
+Begin by preparing your development environment for compiling AICI components, primarily coded in Rust. Additionally, ensure that Python 3.11 or later is installed, as it is essential for crafting controllers.
 
-To **develop a new controller**, use a Rust [starter project](./uppercase) that shows usage of [aici_abi](./aici_abi)
-library, which simplifies implementing the [low-level AICI interface](aici_abi/README.md#low-level-interface).
-> 🧑‍💻[Sample code for a minimal new controller](./uppercase) to get you started
+### Windows WSL / Linux / macOS
 
-To **add AICI support to a new LLM inference engine**,
-you will need to implement LLM-side of the [protocol](aicirt/aicirt-proto.md)
-that talks to [AICI runtime](aicirt).
+> [!NOTE]
+> **Windows users**: please use a devcontainer or WSL2, as per the [Linux instructions](#build-setup-on-linux-including-wsl2). Native Windows support [is tracked here](https://github.com/microsoft/aici/issues/42).
+> 
+> **MacOS users**: please make sure you have XCode command line tools installed by running `xcode-select -p` and if not installed, run `xcode-select --install`.
 
-Finally, you may want to modify any of the provided components - PRs are most welcome!
+Using the system package manager, install the necessary tools for building code in the repository, including `git`, `cmake` and `ccache`. 
 
-To continue, follow one of the build setups below, and continue
-with [running the server](#running-local-server) and [interacting with the server](#interacting-with-server) afterwards.
+For instance in WSL / Ubuntu using `apt`:
 
-### Build setup with devcontainers
+    sudo apt-get install -y --no-install-recommends build-essential cmake ccache pkg-config libssl-dev libclang-dev clang llvm-dev git-lfs
 
-All of the use cases above, except for running an existing controller on remote server,
-require a working [Rust compiler](https://www.rust-lang.org/tools/install),
-while compiling rllm-cuda also requires libtorch and CUDA.
+or using Homebrew on macOS:
 
-- **AICI Client-side** has Rust and C/C++ compilers for developing controllers,
-  [rLLM on llama.cpp](./rllm-cpp) and [aicirt](./aicirt)
-- **AICI with CUDA** has all of the above, plus CUDA and libtorch for
-  [rLLM on libtorch](./rllm-cuda);
-  this requires a CUDA-capable GPU (currently only compute capability 8.0 is supported; this includes A100+ or GeForce 30x0/40x0)
-- **AICI with CUDA and vLLM (experimental)** is for our outdated vLLM integration
+    brew install git
+    brew install cmake
+    brew install ccache
 
-If you're not familiar with [devcontainers](https://containers.dev/),
-you need to install the [Dev Containers VSCode extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
-and from the command palette in VSCode select **Dev Containers: Reopen in Container...**.
-It pops a list of available devcontainers, select the one you want to use.
+Then install **Rust, Rustup and Cargo** following the instructions provided [here](https://doc.rust-lang.org/cargo/getting-started/installation.html) and [here](https://www.rust-lang.org/learn/get-started).
 
-### Build setup on Linux (including WSL2)
+    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
 
-This should be roughly equivalent to the **AICI Client-side** devcontainer.
-See also [common.dockerfile](.devcontainer/common.dockerfile).
+After installation, verify that the `rustup --version` command is accessible by running it from the terminal. If the command isn't recognized, try opening a new terminal session.
+  
+Next install wasm32-wasi component:
+    
+    rustup target add wasm32-wasi
 
-- install required packages; it's likely you already have some or all of these
-  but the list should be exhaustive for fresh Ubuntu-22.04 install in WSL
+If you already had Rust installed, or are getting complaints from cargo about outdated versions, run:
 
-```bash
-sudo apt-get install -y --no-install-recommends \
-    build-essential ca-certificates ccache \
-    cmake curl libjpeg-dev libpng-dev \
-    strace linux-tools-common linux-tools-generic \
-    llvm-dev libclang-dev clang ccache apache2-utils git-lfs \
-    screen bsdmainutils pip python3-dev python-is-python3 \
-    nodejs npm pkg-config
+    rustup update
 
-pip install pytest pytest-forked ujson posix_ipc numpy requests
-```
+Finally, if you plan working with **Python** controllers and scripts, install these packages:
+
+    pip install pytest pytest-forked ujson posix_ipc numpy requests
 
-- [install](https://www.rust-lang.org/tools/install) rustup and restart current shell
 
-```bash
-curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+## Build and start rLLM server and AICI Runtime
+
+Clone the AICI repository and proceed with the next steps outlined below.
+
+### Using CUDA (rllm-cuda)
+
+If your platform has a CUDA-capable GPU (currently only compute capability 8.0 is supported; this includes A100+ or GeForce 30x0/40x0), navigate to the `rllm-cuda` folder and run the command below. This folder contains specific implementations tailored for CUDA-supported platforms.
+
+    cd rllm-cuda/
+    ./server.sh build
+
+After completing the build process, you can start it, specifying a model name, URL, or path as a parameter:
+
+    ./server.sh phi2
+
+If you prefer using Orca-2 13B model run this command:
+
+    ./server.sh orca
+
+You can find more details about `rllm-cuda` [here](rllm-cuda/README.md).
+
+### Using llama.cpp (rllm-cpp)
+
+For those utilizing Apple ARM-based M series processors or lacking CUDA capabilities but still aiming to run models on the CPU, use the following command from the `rllm-cuda` folder, which is derived from the llama.cpp project:
+
+    cd rllm-cpp
+    ./cpp-server.sh phi2
+
+After completing the build process, you can start it, specifying a model name, URL, or path as a parameter:
+
+    ./cpp-server.sh phi2
+
+You can find more details about `rllm-cpp` [here](rllm-cpp/README.md).
+
+### Server overview
+
+At this stage, your rLLM server instance should be up and running, nearly prepared to handle incoming requests. The following diagram illustrates how AICI and LLMs utilize CPU and GPU resources:
+
+```mermaid
+erDiagram
+    Host    ||--|{ CPU : ""
+    Host    ||--|{ GPU : ""
+    
+    CPU     ||--|| "rLMM Server" : execute
+    CPU     ||--|{ "AICI Runtime" : execute
+
+    GPU     ||--|{ "LLM token generation" : execute
+```
+
+The rLLM server provides an HTTP interface, utilized for both configuration tasks and sending requests. You can also utilize this interface to promptly verify its status. For instance, if you open http://127.0.0.1:4242/v1/models, you should see:
+
+```json
+{
+  "object": "list",
+  "data": [
+    {
+      "object": "model",
+      "id": "TheBloke/phi-2-GGUF",
+      "created": 946810800,
+      "owned_by": "owner"
+    }
+  ]
+}
 ```
 
-- install rustup components:
+confirming that the selected model is loaded.
+
+## Control AI output using AICI controller
+
+AICI provides the capability to host custom logic known as **Controllers**, enabling the initiation, termination, and interaction with LLMs token generation. Each controller accepts input arguments, processes them, and returns a result comprising logs, LLM tokens, and variables.
 
-```bash
-rustup target add wasm32-wasi
-rustup component add rustfmt
+The repository includes some examples, in particular:
+
+* **jsctrl**: a controller that accepts JavaScript code as input for execution. This code can interact with the model to generate text and tokens.
+* **pyctrl**: a controller that accepts Python code as input for execution. This code can also interact with the model to generate text and tokens.
+
+In this example we'll utilize **pyctrl** to manage token generation using a simple **Python script**. It's important to note that controllers require building and deployment, while scripts are sent with each request.
+
+### Build and Upload pyctrl controller
+
+Execute the following command to build and upload the controller to the rLLM server:
+
+    ./aici.sh build pyctrl/ --tag pyctrl-latest
+
+The command utilizes the `aici.sh` utility to build the code in the `pyctrl/` folder, assigning a tag to the deployment. You can view all the deployed tags at http://127.0.0.1:4242/v1/controllers/tags:
+
+```json
+{
+  "tags": [
+    {
+      "tag": "pyctrl-latest",
+      "module_id": "684885754ec0f7620efc07a79733875450707116f0edb3fdbbcc26c6751b428f",
+      "updated_at": 1707261407,
+      "updated_by": "localhost",
+      "wasm_size": 13941724,
+      "compiled_size": 39470880
+    }
+  ]
+}
 ```
 
-- if you already had rust installed, or are getting complaints from cargo about outdated version,
-run:
+At this point, you should have an rLLM server instance running with your controller, fully prepared to handle incoming requests. The following diagram integrates the controller just uploaded:
+
+```mermaid
+erDiagram
+    Host    ||--|{ CPU : ""
+    Host    ||--|{ GPU : ""
+    
+    CPU     ||--|| "rLMM Server" : execute
+    CPU     ||--|{ "AICI Runtime" : execute
+
+    "AICI Runtime" ||--|| "Controller" : instantiate
 
-```bash
-rustup update
-rustup target add wasm32-wasi
+    GPU     ||--|{ "LLM token generation" : execute
 ```
 
-- now, [build and run local server](#running-local-server)
+### Controlling the LLM token generation
 
-### Build setup on macOS
+Suppose we aim for a model to generate a list, adhering to a specific format and containing only five items.
 
-Make sure you have XCode command line tools installed
-by running `xcode-select -p` and if not installed, run `xcode-select --install`.
+Typically, achieving this involves prompt engineering, crafting the prompt precisely with clear instructions, such as:
 
-Install required packages via brew:
+```
+What are the most popular types of vehicles?
+Return the result as a numbered list.
+Do not add explanations, only the list.
+```
 
-```bash
-brew install cmake git ccache
+The prompt would also vary depending on the model in use, given that each model tend to add explanations and understand instructions in different ways.
+
+With AICI, we shift control back to code, and we can simplify the prompt to:
+
+```
+What are the most popular types of vehicles?
 ```
 
-Install rustup as per the [Linux instructions](#build-setup-on-linux-including-wsl2) above.
+using code to:
 
-[Build](#running-local-server) the `rllm-cpp`; it should auto-detect and use Metal acceleration on Apple Silicon.
+1. Prevent the model from adding some initial explanation
+2. Limit the list to 5 items
+3. Format to a numbered list
+4. Stop the model from adding some text after the list.
 
-### Build setup on Windows
+Let's create a `list-of-five.py` python file with the following content:
 
-Please use a devcontainer or WSL2, as per the [Linux instructions](#build-setup-on-linux-including-wsl2) above.
+```python
+import pyaici.server as aici
 
-[Tracking issue](https://github.com/microsoft/aici/issues/42) for native Windows support.
+# Force the model to generate a well formatted list of 5 items, e.g.
+#   1. name 1
+#   2. name 2
+#   3. name 3
+#   4. name 4
+#   5. name 5
+async def main():
+    
+    # This is the prompt we want to run. Note that the prompt doesn't mention a number of vehicles.
+    prompt = "What are the most popular types of vehicles?\n"
 
-### Running local server
+    # Tell the model to generate the prompt string, ie. let's start with the prompt "to complete"
+    await aici.FixedTokens(prompt)
 
-If you have CUDA, go to `rllm-cuda/` and run `./server.sh orca`.
-This will run the inference server with Orca-2 13B model (which is expected by testcases).
+    # Store the current position in the token generation process
+    marker = aici.Label()
 
-If you don't have CUDA, go to `rllm-cpp/` and run `./cpp-server.sh phi2`
-(phi2 is small enough to run on a CPU).
-You can also pass GGUF URL on HuggingFace.
+    for i in range(1,6):
+      # Tell the model to generate the list number
+      await aici.FixedTokens(f"{i}.")
 
-Both of these commands first compile aicirt and the inference engine,
-and then run it.
-You can also try other models, see README.md files for [rllm-cuda](rllm-cuda/README.md) and
-[rllm-cpp](rllm-cpp/README.md) as well as the shell scripts themselves for details.
+      # Wait for the model to generate a vehicle name and end with a new line
+      await aici.gen_text(stop_at = "\n")
 
-### Interacting with server
+    await aici.FixedTokens("\n")
 
-To get started interacting with a cloud AICI server first export the API key.
-If running local server, leave `AICI_API_BASE` unset.
+    # Store the tokens generated in a result variable
+    aici.set_var("result", marker.text_since())
 
-```bash
-export AICI_API_BASE="https://inference.example.com/v1/#key=wht_..."
+aici.start(main())
 ```
 
-Now, use query the model with or without AICI Controller:
+Running the script is not too different from sending a prompt. In this case, we're sending control logic and instructions all together.
 
-```bash
-./aici.sh infer "The answer to the ultimate question of life"
-./aici.sh run --build pyctrl pyctrl/samples/test.py
-./aici.sh run --build jsctrl jsctrl/samples/hello.js
-./aici.sh run --build aici_abi::yesno
+To see the final result, execute the following command:
+
+    ./aici.sh run list-of-five.py
+
+Result:
+```
+Running with tagged AICI Controller: pyctrl-latest
+[0]: FIXED 'What are the most popular types of vehicles?\n'
+[0]: FIXED '1.'
+[0]: GEN ' Sedans\n'
+[0]: FIXED '2.'
+[0]: GEN ' SUVs\n'
+[0]: FIXED '3.'
+[0]: GEN ' Trucks\n'
+[0]: FIXED '4.'
+[0]: GEN ' Sports cars\n'
+[0]: FIXED '5.'
+[0]: GEN ' Minivans\n'
+[0]: FIXED '\n'
+[DONE]
+[Response] What are the most popular types of vehicles?
+1. Sedans
+2. SUVs
+3. Trucks
+4. Sports cars
+5. Minivans
+
+
+response saved to tmp/response.json
+Usage: {'sampled_tokens': 17, 'ff_tokens': 38, 'cost': 72}
+Storage: {'result': '1. Sedans\n2. SUVs\n3. Trucks\n4. Sports cars\n5. Minivans\n\n'}
 ```
 
-Run `./aici.sh -h` to see usage info.
+# Comprehensive Guide: Exploring Further
+
+This repository contains a number of components, and which ones you need depends on your use case.
+
+You can **use an existing controller module**.
+We provide [PyCtrl](./pyctrl) and [JsCtrl](./jsctrl)
+that let you script controllers using server-side Python and JavaScript, respectively.
+The [pyaici](./pyaici) package contains `aici` command line tool that lets you
+[upload and run scripts](./proxy.md) with any controller
+(we also provide [REST API definition](./REST.md) for the curious).
+> 🧑‍💻[Python code samples for scripting PyCtrl](./pyctrl) and a [JavaScript Hello World for JSCtrl](./jsctrl/samples/hello.js)
+
+We anticipate [libraries](#architecture) will be built on top of controllers.
+We provide an example in [promptlib](./promptlib) - a client-side Python library
+that generates interacts with [DeclCtrl](./declctrl) via the pyaici package.
+> 🧑‍💻 [Example notebook that uses PromptLib to interact with DeclCtrl](./promptlib/notebooks/basics_tutorial.ipynb).
 
-If the server is running with Orca-2 13B model,
-you can also run tests with `pytest` for the DeclCtrl, 
-with `./scripts/test-pyctrl.sh` for PyCtrl,
-or with `./scripts/test-jsctrl.sh` for JsCtrl.
+The controllers can be run in a cloud or local AICI-enabled LLM inference engine.
+You can **run the provided reference engine (rLLM) locally** with either
+[libtorch+CUDA](./rllm-cuda) or [llama.cpp backend](./rllm-cpp).
+
+To **develop a new controller**, use a Rust [starter project](./uppercase) that shows usage of [aici_abi](./aici_abi)
+library, which simplifies implementing the [low-level AICI interface](aici_abi/README.md#low-level-interface).
+> 🧑‍💻[Sample code for a minimal new controller](./uppercase) to get you started
+
+To **add AICI support to a new LLM inference engine**,
+you will need to implement LLM-side of the [protocol](aicirt/aicirt-proto.md)
+that talks to [AICI runtime](aicirt).
+
+Finally, you may want to modify any of the provided components - PRs are most welcome!
 
-## Architecture
+# Architecture
 
 AICI abstracts LLM inference engine from the controller and vice-versa, as in the picture below.
 The rounded nodes are aspirational.
@@ -212,7 +353,7 @@ The support for [HuggingFace Transformers](harness/run_hf.py)
 and [vLLM REST server](harness/vllm_server.py) is currently out of date.
 Please use the [rLLM-cuda](rllm-cuda) or [rLLM-llama-cpp](rllm-cpp) for now.
 
-## Security
+# Security
 
 - `aicirt` runs in a separate process, and can run under a different user than the LLM engine
 - Wasm modules are [sandboxed by Wasmtime](https://docs.wasmtime.dev/security.html)
@@ -226,7 +367,7 @@ Please use the [rLLM-cuda](rllm-cuda) or [rLLM-llama-cpp](rllm-cpp) for now.
 In particular, Wasm modules cannot access the filesystem, network, or any other resources.
 They also cannot spin threads or access any timers (this is relevant for Spectre/Meltdown attacks).
 
-## Performance
+# Performance
 
 Most of computation in AICI Controllers occurs on the CPU, in parallel with the logit generation on the GPU.
 The generation occurs in steps, where logits are generated in parallel for a new token for each sequence in a batch
@@ -265,7 +406,7 @@ This is 10-100x better than JavaScript or Python.
 
 All measurements done on AMD EPYC 7V13 with nVidia A100 GPU with 80GB of VRAM.
 
-## Flexibility
+# Flexibility
 
 The low-level interface that AICI runtime provides allows for:
 
@@ -287,7 +428,7 @@ We also provide [Python](pyctrl) and [JavaScript](jsctrl) interpreters
 that allow to glue these constraints together.
 All of these can be easily extended.
 
-## Acknowledgements
+# Acknowledgements
 
 - [Flash Attention kernels](tch-cuda/kernels/flash_attn/) are copied from
   [flash-attention repo](https://github.com/Dao-AILab/flash-attention);
@@ -311,7 +452,7 @@ All of these can be easily extended.
 - the [example ANSI C grammar](aici_abi/grammars/c.y) is based on
   https://www.lysator.liu.se/c/ANSI-C-grammar-y.html by Jeff Lee (from 1985)
 
-## Contributing
+# Contributing
 
 This project welcomes contributions and suggestions. Most contributions require you to agree to a
 Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
@@ -325,7 +466,7 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
 For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
 contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
 
-## Trademarks
+# Trademarks
 
 This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
 trademarks or logos is subject to and must follow