-
Notifications
You must be signed in to change notification settings - Fork 31
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #13 from nerdai/nerdai/arc-finetuning-cookbook-lla…
…maindex Add updated llama-index cookbook
- Loading branch information
Showing
42 changed files
with
4,070 additions
and
3,343 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,3 +11,6 @@ pyproject.local.toml | |
__pycache__ | ||
data | ||
notebooks | ||
.DS_Store | ||
finetuning_examples | ||
finetuning_assets |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,72 +1,179 @@ | ||
# LlamaAgents Demo With Snowflake/Cybersyn Data Agents | ||
# ARC Task (LLM) Solver With Human Input | ||
|
||
<img width="960" alt="image" src="https://github.com/user-attachments/assets/2d82ac2b-d37f-4b86-9867-69947402c924"> | ||
The Abstraction and Reasoning Corpus ([ARC](https://github.com/fchollet/ARC-AGI)) for Artificial General Intelligence | ||
benchmark aims to measure an AI system's ability to efficiently learn new skills. | ||
Each task within the ARC benchmark contains a unique puzzle for which the systems | ||
attempt to solve. Currently, the best AI systems achieve 34% solve rates, whereas | ||
humans are able to achieve 85% ([source](https://www.kaggle.com/competitions/arc-prize-2024/overview/prizes)). | ||
|
||
For this demo app, we have a multi-agent system comprised with the following | ||
components: | ||
<p align="center"> | ||
<img height="300" src="https://d3ddy8balm3goa.cloudfront.net/arc-task-solver-st-demo/arc-task.svg" alt="cover"> | ||
</p> | ||
|
||
- A data **Agent** that performs queries over [Cybersyn's Financial & Economic Essentials](https://app.snowflake.com/marketplace/listing/GZTSZAS2KF7/cybersyn-financial-economic-essentials?originTab=provider&providerName=Cybersyn&profileGlobalName=GZTSZAS2KCS) Dataset | ||
- A data **Agent** that performs queries over [Cyberysyn' Government Essentials](https://app.snowflake.com/marketplace/listing/GZTSZAS2KGK/cybersyn-government-essentials?originTab=provider&providerName=Cybersyn&profileGlobalName=GZTSZAS2KCS) Dataset | ||
- A general **Agent** that can answer all general queries | ||
- A **Human (In the Loop) Service** that provides inputs to the two data agents | ||
- A **ControlPlane** that features a router which routes tasks to the most appropriate orchestrator | ||
- A **RabbitMQ MessageQueue** to broker communication between agents, human-in-the-loop, and control-plane | ||
Motivated by this large disparity, we built this app with the goal of injecting | ||
human-level reasoning on this benchmark to LLMs. Specifically, this app enables | ||
the collaboration of LLMs and humans to solve an ARC task; and these collaborations | ||
can then be used for fine-tuning the LLM. | ||
|
||
For the frontend, we built a Streamlit App to interact with this multi-agent | ||
system. | ||
The Solver itself is a LlamaIndex `Workflow` that relies on successive runs for | ||
which `Context` is maintained from previous runs. Doing so allows for an | ||
effective implementation of the Human In the Loop Pattern. | ||
|
||
## Pre-Requisites | ||
<p align="center"> | ||
<img height="500" src="https://d3ddy8balm3goa.cloudfront.net/arc-task-solver-st-demo/human-in-loop-2.excalidraw.svg" alt="cover"> | ||
</p> | ||
|
||
### Poetry | ||
## Running The App | ||
|
||
For this app, we use [Poetry](https://python-poetry.org/) as the package's | ||
dependency manager. The `poetry` cli tool is what we'll need to install the package's | ||
virtual environment in order to run our streamlit app. | ||
Before running the streamlit app, we first must download the ARC dataset. The | ||
below command will download the dataset and store it in a directory named `data/`: | ||
|
||
### Docker | ||
```sh | ||
wget https://github.com/fchollet/ARC-AGI/archive/refs/heads/master.zip -O ./master.zip | ||
unzip ./master.zip -d ./ | ||
mv ARC-AGI-master/data ./ | ||
rm -rf ARC-AGI-master | ||
rm master.zip | ||
``` | ||
|
||
To run this demo, we make use of `Docker`, specifically `docker-compose`. For this | ||
demo, all of the necessary services (with the exception of the message queue) | ||
are packaged in one common Docker image and can be instantianted through their | ||
respective commands (i.e., see `docker-compose.yml`.) | ||
Next, we must install the app's dependencies. To do so, we can use `poetry`: | ||
|
||
### Credentials | ||
```sh | ||
poetry shell | ||
poetry install | ||
``` | ||
|
||
For this app, we use OpenAI as the LLM provider and so an `OPENAI_API_KEY` will | ||
need to be supplied. Moreover, the Cybersyn data is pulled from Snowflake and so | ||
various Snowflake params are also required. See the section "# FILL-IN" in the | ||
`template.env.docker` file. Once, you've filled in the necessary environment | ||
variables, rename the file to `.env.docker`. | ||
Finally, to run the streamlit app: | ||
|
||
Similarly, you need to provide credentials in the `template.env.local`. Once | ||
filled in, rename the file to `.env.local`. | ||
```sh | ||
export OPENAI_API_KEY=<FILL-IN> && streamlit run arc_finetuning_st/streamlit/app.py | ||
``` | ||
|
||
## Running The App | ||
## How To Use The App | ||
|
||
In the next two sections, we discuss how to use the app in order to solve a given | ||
ARC task. | ||
|
||
<p align="center"> | ||
<img height="500" src="https://d3ddy8balm3goa.cloudfront.net/arc-task-solver-st-demo/arc-task-solver-app.svg" alt="cover"> | ||
</p> | ||
|
||
### Solving an ARC Task | ||
|
||
Each ARC task consists of training examples, each of which consist of input and | ||
output pairs. There exists a common pattern between these input and output pairs, | ||
and the problem is solved by uncovering this pattern, which can be verified by | ||
the included test examples. | ||
|
||
To solve the task, we cycle through the following three steps: | ||
|
||
1. Prediction (of test output grid) | ||
2. Evaluation | ||
3. Critique (human in the loop) | ||
|
||
(Under the hood a LlamaIndex `Workflow` implements these three `steps`.) | ||
|
||
Step 1. makes use of an LLM to produce the Prediction whereas Step 2. is | ||
deterministic and is a mere comparison between the ground truth test output and | ||
the Prediction. If the Prediction doesn't match the ground truth grid, then Step 3. | ||
is performed. Similar to step 1. an LLM is prompted to generate a Critique on the | ||
Prediction as to why it may not match the pattern underlying the train input and | ||
output pairs. However, we also allow for a human in the loop to override this | ||
LLM generated Critique. | ||
|
||
The Critique is carried on from a previous cycle onto the next in order to | ||
generate an improved and hopefully correct next Prediction. | ||
|
||
To begin, click the `Start` button found in the top-right corner. If the | ||
prediction is incorrect, you can view the Critique produced by the LLM in the | ||
designated text area. You can choose to use this Critique or supply your own by | ||
overwriting the text and applying the change. Once ready to produce the next | ||
prediction, hit the `Continue` button. | ||
|
||
### Saving solutions for fine-tuning | ||
|
||
Any collaboration session involving the LLM and human can be saved and used to | ||
finetune an LLM. In this app, we use OpenAI LLMs, and so the finetuning examples | ||
adhere to the [OpenAI fine-tuning API](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset). | ||
Click the `fine-tuning example` button during a session to see the current | ||
example that can be used for fine-tuning. | ||
|
||
### The backend (multi-agent system) | ||
<p align="center"> | ||
<img height="500" src="https://d3ddy8balm3goa.cloudfront.net/arc-task-solver-st-demo/finetuning-arc-example.svg" alt="cover"> | ||
</p> | ||
|
||
To start the multi-agent system, use the following command while in the root of | ||
the project directory: | ||
## Fine-tuning (with `arc-finetuning-cli`) | ||
|
||
After you've created your finetuning examples (you'll need at least 10 of them), | ||
you can submit a job to OpenAI to finetune an LLM on them. To do so, we have a | ||
convenient command line tool, that is powered by LlamaIndex plugins such as | ||
`llama-index-finetuning`. | ||
|
||
```sh | ||
docker-compose up --build | ||
arc finetuning cli tool. | ||
|
||
options: | ||
-h, --help show this help message and exit | ||
|
||
commands: | ||
{evaluate,finetune,job-status} | ||
evaluate Evaluation of ARC Task predictions with LLM and ARCTaskSolverWorkflow. | ||
finetune Finetune OpenAI LLM on ARC Task Solver examples. | ||
job-status Check the status of finetuning job. | ||
``` | ||
|
||
### Streamlit App | ||
### Submitting a fine-tuning job | ||
|
||
Once the services are all running, you can then run the streamlit app. First, | ||
ensure that you have the package's virtual environment active and the environment | ||
variables set. Again, while in the root directory of this project, run the commands | ||
found below: | ||
To submit a fine-tuning job, use any of the following three `finetune` command: | ||
|
||
```sh | ||
poetry shell | ||
poetry install | ||
set -a && source .env.local | ||
# submit a new finetune job using the specified llm | ||
arc-finetuning-cli finetune --llm gpt-4o-2024-08-06 | ||
|
||
# submit a new finetune job that continues from previously finetuned model | ||
arc-finetuning-cli finetune --llm gpt-4o-2024-08-06 --start-job-id ftjob-TqJd5Nfe3GIiScyTTJH56l61 | ||
|
||
# submit a new finetune job that continues from the most recent finetuned model | ||
arc-finetuning-cli finetune --continue-latest | ||
``` | ||
|
||
Next, run the streamlit app: | ||
The commands above will take care of compiling all of the single finetuning json | ||
examples (i.e. stored in `finetuning_examples/`) into a single `jsonl` file that | ||
is then passed to OpenAI finetuning API. | ||
|
||
### Checking the status of a fine-tuning job | ||
|
||
After submitting a job, you can check its status using the below cli commands: | ||
|
||
```sh | ||
streamlit run llamaindex_cookbook/apps/streamlit.py | ||
arc-finetuning-cli job-status -j ftjob-WYySY3iGYpfiTbSDeKDZO0YL -m gpt-4o-2024-08-06 | ||
|
||
# or check status of the latest job submission | ||
arc-finetuning-cli job-status --latest | ||
``` | ||
|
||
## Evaluation | ||
|
||
You can evaluate the `ARCTaskSolverWorkflow` and a specified LLM on the ARC test | ||
dataset. You can even supply a fine-tuned LLM here. | ||
|
||
```sh | ||
# evaluate ARCTaskSolverWorkflow single attempt with gpt-4o | ||
arc-finetuning-cli evaluate --llm gpt-4o-2024-08-06 | ||
|
||
# evaluate ARCTaskSolverWorkflow single attempt with a previously fine-tuned gpt-4o | ||
arc-finetuning-cli evaluate --llm gpt-4o-2024-08-06 --start-job-id ftjob-TqJd5Nfe3GIiScyTTJH56l61 | ||
``` | ||
|
||
You can also specify certain parameters to control the speed of the execution so | ||
as to not run into `RateLimitError`'s from OpenAI. | ||
|
||
```sh | ||
arc-finetuning-cli evaluate --llm gpt-4o-2024-08-06 --batch-size 5 --num-workers 3 --sleep 10 | ||
``` | ||
|
||
In the above command, `batch-size` refers to the number of test cases handled in | ||
single batch. In total, there are 400 test cases. Moreover, `num-workers` is the | ||
maximum number of async calls allowed to be made to OpenAI API at any given moment. | ||
Finally, `sleep` is the amount of time in seconds the execution halts before moving | ||
onto the next batch of test cases. |
This file was deleted.
Oops, something went wrong.
File renamed without changes.
File renamed without changes.
Oops, something went wrong.