Merge pull request #13 from nerdai/nerdai/arc-finetuning-cookbook-lla…

…maindex Add updated llama-index cookbook
streamlit · Sep 24, 2024 · 9882dc4 · 9882dc4
2 parents 7a9a038 + c3d75e3
commit 9882dc4
Show file tree

Hide file tree

Showing 42 changed files with 4,070 additions and 3,343 deletions.
diff --git a/recipes/llamaindex/.gitignore b/recipes/llamaindex/.gitignore
@@ -11,3 +11,6 @@ pyproject.local.toml
 __pycache__
 data
 notebooks
+.DS_Store
+finetuning_examples
+finetuning_assets
diff --git a/recipes/llamaindex/Dockerfile b/recipes/llamaindex/Dockerfile
diff --git a/recipes/llamaindex/README.md b/recipes/llamaindex/README.md
@@ -1,72 +1,179 @@
-# LlamaAgents Demo With Snowflake/Cybersyn Data Agents
+# ARC Task (LLM) Solver With Human Input
 
-<img width="960" alt="image" src="https://github.com/user-attachments/assets/2d82ac2b-d37f-4b86-9867-69947402c924">
+The Abstraction and Reasoning Corpus ([ARC](https://github.com/fchollet/ARC-AGI)) for Artificial General Intelligence
+benchmark aims to measure an AI system's ability to efficiently learn new skills.
+Each task within the ARC benchmark contains a unique puzzle for which the systems
+attempt to solve. Currently, the best AI systems achieve 34% solve rates, whereas
+humans are able to achieve 85% ([source](https://www.kaggle.com/competitions/arc-prize-2024/overview/prizes)).
 
-For this demo app, we have a multi-agent system comprised with the following
-components:
+<p align="center">
+  <img height="300" src="https://d3ddy8balm3goa.cloudfront.net/arc-task-solver-st-demo/arc-task.svg" alt="cover">
+</p>
 
-- A data **Agent** that performs queries over [Cybersyn's Financial & Economic Essentials](https://app.snowflake.com/marketplace/listing/GZTSZAS2KF7/cybersyn-financial-economic-essentials?originTab=provider&providerName=Cybersyn&profileGlobalName=GZTSZAS2KCS) Dataset
-- A data **Agent** that performs queries over [Cyberysyn' Government Essentials](https://app.snowflake.com/marketplace/listing/GZTSZAS2KGK/cybersyn-government-essentials?originTab=provider&providerName=Cybersyn&profileGlobalName=GZTSZAS2KCS) Dataset
-- A general **Agent** that can answer all general queries
-- A **Human (In the Loop) Service** that provides inputs to the two data agents
-- A **ControlPlane** that features a router which routes tasks to the most appropriate orchestrator
-- A **RabbitMQ MessageQueue** to broker communication between agents, human-in-the-loop, and control-plane
+Motivated by this large disparity, we built this app with the goal of injecting
+human-level reasoning on this benchmark to LLMs. Specifically, this app enables
+the collaboration of LLMs and humans to solve an ARC task; and these collaborations
+can then be used for fine-tuning the LLM.
 
-For the frontend, we built a Streamlit App to interact with this multi-agent
-system.
+The Solver itself is a LlamaIndex `Workflow` that relies on successive runs for
+which `Context` is maintained from previous runs. Doing so allows for an
+effective implementation of the Human In the Loop Pattern.
 
-## Pre-Requisites
+<p align="center">
+  <img height="500" src="https://d3ddy8balm3goa.cloudfront.net/arc-task-solver-st-demo/human-in-loop-2.excalidraw.svg" alt="cover">
+</p>
 
-### Poetry
+## Running The App
 
-For this app, we use [Poetry](https://python-poetry.org/) as the package's
-dependency manager. The `poetry` cli tool is what we'll need to install the package's
-virtual environment in order to run our streamlit app.
+Before running the streamlit app, we first must download the ARC dataset. The
+below command will download the dataset and store it in a directory named `data/`:
 
-### Docker
+```sh
+wget https://github.com/fchollet/ARC-AGI/archive/refs/heads/master.zip -O ./master.zip
+unzip ./master.zip -d ./
+mv ARC-AGI-master/data ./
+rm -rf ARC-AGI-master
+rm master.zip
+```
 
-To run this demo, we make use of `Docker`, specifically `docker-compose`. For this
-demo, all of the necessary services (with the exception of the message queue)
-are packaged in one common Docker image and can be instantianted through their
-respective commands (i.e., see `docker-compose.yml`.)
+Next, we must install the app's dependencies. To do so, we can use `poetry`:
 
-### Credentials
+```sh
+poetry shell
+poetry install
+```
 
-For this app, we use OpenAI as the LLM provider and so an `OPENAI_API_KEY` will
-need to be supplied. Moreover, the Cybersyn data is pulled from Snowflake and so
-various Snowflake params are also required. See the section "# FILL-IN" in the
-`template.env.docker` file. Once, you've filled in the necessary environment
-variables, rename the file to `.env.docker`.
+Finally, to run the streamlit app:
 
-Similarly, you need to provide credentials in the `template.env.local`. Once
-filled in, rename the file to `.env.local`.
+```sh
+export OPENAI_API_KEY=<FILL-IN> && streamlit run arc_finetuning_st/streamlit/app.py
+```
 
-## Running The App
+## How To Use The App
+
+In the next two sections, we discuss how to use the app in order to solve a given
+ARC task.
+
+<p align="center">
+  <img height="500" src="https://d3ddy8balm3goa.cloudfront.net/arc-task-solver-st-demo/arc-task-solver-app.svg" alt="cover">
+</p>
+
+### Solving an ARC Task
+
+Each ARC task consists of training examples, each of which consist of input and
+output pairs. There exists a common pattern between these input and output pairs,
+and the problem is solved by uncovering this pattern, which can be verified by
+the included test examples.
+
+To solve the task, we cycle through the following three steps:
+
+1. Prediction (of test output grid)
+2. Evaluation
+3. Critique (human in the loop)
+
+(Under the hood a LlamaIndex `Workflow` implements these three `steps`.)
+
+Step 1. makes use of an LLM to produce the Prediction whereas Step 2. is
+deterministic and is a mere comparison between the ground truth test output and
+the Prediction. If the Prediction doesn't match the ground truth grid, then Step 3.
+is performed. Similar to step 1. an LLM is prompted to generate a Critique on the
+Prediction as to why it may not match the pattern underlying the train input and
+output pairs. However, we also allow for a human in the loop to override this
+LLM generated Critique.
+
+The Critique is carried on from a previous cycle onto the next in order to
+generate an improved and hopefully correct next Prediction.
+
+To begin, click the `Start` button found in the top-right corner. If the
+prediction is incorrect, you can view the Critique produced by the LLM in the
+designated text area. You can choose to use this Critique or supply your own by
+overwriting the text and applying the change. Once ready to produce the next
+prediction, hit the `Continue` button.
+
+### Saving solutions for fine-tuning
+
+Any collaboration session involving the LLM and human can be saved and used to
+finetune an LLM. In this app, we use OpenAI LLMs, and so the finetuning examples
+adhere to the [OpenAI fine-tuning API](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset).
+Click the `fine-tuning example` button during a session to see the current
+example that can be used for fine-tuning.
 
-### The backend (multi-agent system)
+<p align="center">
+  <img height="500" src="https://d3ddy8balm3goa.cloudfront.net/arc-task-solver-st-demo/finetuning-arc-example.svg" alt="cover">
+</p>
 
-To start the multi-agent system, use the following command while in the root of
-the project directory:
+## Fine-tuning (with `arc-finetuning-cli`)
+
+After you've created your finetuning examples (you'll need at least 10 of them),
+you can submit a job to OpenAI to finetune an LLM on them. To do so, we have a
+convenient command line tool, that is powered by LlamaIndex plugins such as
+`llama-index-finetuning`.
 
 ```sh
-docker-compose up --build
+arc finetuning cli tool.
+
+options:
+  -h, --help            show this help message and exit
+
+commands:
+  {evaluate,finetune,job-status}
+    evaluate            Evaluation of ARC Task predictions with LLM and ARCTaskSolverWorkflow.
+    finetune            Finetune OpenAI LLM on ARC Task Solver examples.
+    job-status          Check the status of finetuning job.
 ```
 
-### Streamlit App
+### Submitting a fine-tuning job
 
-Once the services are all running, you can then run the streamlit app. First,
-ensure that you have the package's virtual environment active and the environment
-variables set. Again, while in the root directory of this project, run the commands
-found below:
+To submit a fine-tuning job, use any of the following three `finetune` command:
 
 ```sh
-poetry shell
-poetry install
-set -a && source .env.local
+# submit a new finetune job using the specified llm
+arc-finetuning-cli finetune --llm gpt-4o-2024-08-06
+
+# submit a new finetune job that continues from previously finetuned model
+arc-finetuning-cli finetune --llm gpt-4o-2024-08-06 --start-job-id ftjob-TqJd5Nfe3GIiScyTTJH56l61
+
+# submit a new finetune job that continues from the most recent finetuned model
+arc-finetuning-cli finetune --continue-latest
 ```
 
-Next, run the streamlit app:
+The commands above will take care of compiling all of the single finetuning json
+examples (i.e. stored in `finetuning_examples/`) into a single `jsonl` file that
+is then passed to OpenAI finetuning API.
+
+### Checking the status of a fine-tuning job
+
+After submitting a job, you can check its status using the below cli commands:
 
 ```sh
-streamlit run llamaindex_cookbook/apps/streamlit.py
+arc-finetuning-cli job-status -j ftjob-WYySY3iGYpfiTbSDeKDZO0YL -m gpt-4o-2024-08-06
+
+# or check status of the latest job submission
+arc-finetuning-cli job-status --latest
 ```
+
+## Evaluation
+
+You can evaluate the `ARCTaskSolverWorkflow` and a specified LLM on the ARC test
+dataset. You can even supply a fine-tuned LLM here.
+
+```sh
+# evaluate ARCTaskSolverWorkflow single attempt with gpt-4o
+arc-finetuning-cli evaluate --llm gpt-4o-2024-08-06
+
+# evaluate ARCTaskSolverWorkflow single attempt with a previously fine-tuned gpt-4o
+arc-finetuning-cli evaluate --llm gpt-4o-2024-08-06 --start-job-id ftjob-TqJd5Nfe3GIiScyTTJH56l61
+```
+
+You can also specify certain parameters to control the speed of the execution so
+as to not run into `RateLimitError`'s from OpenAI.
+
+```sh
+arc-finetuning-cli evaluate --llm gpt-4o-2024-08-06 --batch-size 5 --num-workers 3 --sleep 10
+```
+
+In the above command, `batch-size` refers to the number of test cases handled in
+single batch. In total, there are 400 test cases. Moreover, `num-workers` is the
+maximum number of async calls allowed to be made to OpenAI API at any given moment.
+Finally, `sleep` is the amount of time in seconds the execution halts before moving
+onto the next batch of test cases.
diff --git a/recipes/llamaindex/arc-finetuning/placeholder.md b/recipes/llamaindex/arc-finetuning/placeholder.md
diff --git a/...lamaindex/llamaindex_cookbook/__init__.py → .../llamaindex/arc_finetuning_st/__init__.py b/...lamaindex/llamaindex_cookbook/__init__.py → .../llamaindex/arc_finetuning_st/__init__.py
diff --git a/..._cookbook/additional_services/__init__.py → ...maindex/arc_finetuning_st/cli/__init__.py b/..._cookbook/additional_services/__init__.py → ...maindex/arc_finetuning_st/cli/__init__.py