Skip to content

Commit

Permalink
Merge pull request #13 from nerdai/nerdai/arc-finetuning-cookbook-lla…
Browse files Browse the repository at this point in the history
…maindex

Add updated llama-index cookbook
  • Loading branch information
sfc-gh-cnantasenamat authored Sep 24, 2024
2 parents 7a9a038 + c3d75e3 commit 9882dc4
Show file tree
Hide file tree
Showing 42 changed files with 4,070 additions and 3,343 deletions.
3 changes: 3 additions & 0 deletions recipes/llamaindex/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,6 @@ pyproject.local.toml
__pycache__
data
notebooks
.DS_Store
finetuning_examples
finetuning_assets
48 changes: 0 additions & 48 deletions recipes/llamaindex/Dockerfile

This file was deleted.

197 changes: 152 additions & 45 deletions recipes/llamaindex/README.md
Original file line number Diff line number Diff line change
@@ -1,72 +1,179 @@
# LlamaAgents Demo With Snowflake/Cybersyn Data Agents
# ARC Task (LLM) Solver With Human Input

<img width="960" alt="image" src="https://github.com/user-attachments/assets/2d82ac2b-d37f-4b86-9867-69947402c924">
The Abstraction and Reasoning Corpus ([ARC](https://github.com/fchollet/ARC-AGI)) for Artificial General Intelligence
benchmark aims to measure an AI system's ability to efficiently learn new skills.
Each task within the ARC benchmark contains a unique puzzle for which the systems
attempt to solve. Currently, the best AI systems achieve 34% solve rates, whereas
humans are able to achieve 85% ([source](https://www.kaggle.com/competitions/arc-prize-2024/overview/prizes)).

For this demo app, we have a multi-agent system comprised with the following
components:
<p align="center">
<img height="300" src="https://d3ddy8balm3goa.cloudfront.net/arc-task-solver-st-demo/arc-task.svg" alt="cover">
</p>

- A data **Agent** that performs queries over [Cybersyn's Financial & Economic Essentials](https://app.snowflake.com/marketplace/listing/GZTSZAS2KF7/cybersyn-financial-economic-essentials?originTab=provider&providerName=Cybersyn&profileGlobalName=GZTSZAS2KCS) Dataset
- A data **Agent** that performs queries over [Cyberysyn' Government Essentials](https://app.snowflake.com/marketplace/listing/GZTSZAS2KGK/cybersyn-government-essentials?originTab=provider&providerName=Cybersyn&profileGlobalName=GZTSZAS2KCS) Dataset
- A general **Agent** that can answer all general queries
- A **Human (In the Loop) Service** that provides inputs to the two data agents
- A **ControlPlane** that features a router which routes tasks to the most appropriate orchestrator
- A **RabbitMQ MessageQueue** to broker communication between agents, human-in-the-loop, and control-plane
Motivated by this large disparity, we built this app with the goal of injecting
human-level reasoning on this benchmark to LLMs. Specifically, this app enables
the collaboration of LLMs and humans to solve an ARC task; and these collaborations
can then be used for fine-tuning the LLM.

For the frontend, we built a Streamlit App to interact with this multi-agent
system.
The Solver itself is a LlamaIndex `Workflow` that relies on successive runs for
which `Context` is maintained from previous runs. Doing so allows for an
effective implementation of the Human In the Loop Pattern.

## Pre-Requisites
<p align="center">
<img height="500" src="https://d3ddy8balm3goa.cloudfront.net/arc-task-solver-st-demo/human-in-loop-2.excalidraw.svg" alt="cover">
</p>

### Poetry
## Running The App

For this app, we use [Poetry](https://python-poetry.org/) as the package's
dependency manager. The `poetry` cli tool is what we'll need to install the package's
virtual environment in order to run our streamlit app.
Before running the streamlit app, we first must download the ARC dataset. The
below command will download the dataset and store it in a directory named `data/`:

### Docker
```sh
wget https://github.com/fchollet/ARC-AGI/archive/refs/heads/master.zip -O ./master.zip
unzip ./master.zip -d ./
mv ARC-AGI-master/data ./
rm -rf ARC-AGI-master
rm master.zip
```

To run this demo, we make use of `Docker`, specifically `docker-compose`. For this
demo, all of the necessary services (with the exception of the message queue)
are packaged in one common Docker image and can be instantianted through their
respective commands (i.e., see `docker-compose.yml`.)
Next, we must install the app's dependencies. To do so, we can use `poetry`:

### Credentials
```sh
poetry shell
poetry install
```

For this app, we use OpenAI as the LLM provider and so an `OPENAI_API_KEY` will
need to be supplied. Moreover, the Cybersyn data is pulled from Snowflake and so
various Snowflake params are also required. See the section "# FILL-IN" in the
`template.env.docker` file. Once, you've filled in the necessary environment
variables, rename the file to `.env.docker`.
Finally, to run the streamlit app:

Similarly, you need to provide credentials in the `template.env.local`. Once
filled in, rename the file to `.env.local`.
```sh
export OPENAI_API_KEY=<FILL-IN> && streamlit run arc_finetuning_st/streamlit/app.py
```

## Running The App
## How To Use The App

In the next two sections, we discuss how to use the app in order to solve a given
ARC task.

<p align="center">
<img height="500" src="https://d3ddy8balm3goa.cloudfront.net/arc-task-solver-st-demo/arc-task-solver-app.svg" alt="cover">
</p>

### Solving an ARC Task

Each ARC task consists of training examples, each of which consist of input and
output pairs. There exists a common pattern between these input and output pairs,
and the problem is solved by uncovering this pattern, which can be verified by
the included test examples.

To solve the task, we cycle through the following three steps:

1. Prediction (of test output grid)
2. Evaluation
3. Critique (human in the loop)

(Under the hood a LlamaIndex `Workflow` implements these three `steps`.)

Step 1. makes use of an LLM to produce the Prediction whereas Step 2. is
deterministic and is a mere comparison between the ground truth test output and
the Prediction. If the Prediction doesn't match the ground truth grid, then Step 3.
is performed. Similar to step 1. an LLM is prompted to generate a Critique on the
Prediction as to why it may not match the pattern underlying the train input and
output pairs. However, we also allow for a human in the loop to override this
LLM generated Critique.

The Critique is carried on from a previous cycle onto the next in order to
generate an improved and hopefully correct next Prediction.

To begin, click the `Start` button found in the top-right corner. If the
prediction is incorrect, you can view the Critique produced by the LLM in the
designated text area. You can choose to use this Critique or supply your own by
overwriting the text and applying the change. Once ready to produce the next
prediction, hit the `Continue` button.

### Saving solutions for fine-tuning

Any collaboration session involving the LLM and human can be saved and used to
finetune an LLM. In this app, we use OpenAI LLMs, and so the finetuning examples
adhere to the [OpenAI fine-tuning API](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset).
Click the `fine-tuning example` button during a session to see the current
example that can be used for fine-tuning.

### The backend (multi-agent system)
<p align="center">
<img height="500" src="https://d3ddy8balm3goa.cloudfront.net/arc-task-solver-st-demo/finetuning-arc-example.svg" alt="cover">
</p>

To start the multi-agent system, use the following command while in the root of
the project directory:
## Fine-tuning (with `arc-finetuning-cli`)

After you've created your finetuning examples (you'll need at least 10 of them),
you can submit a job to OpenAI to finetune an LLM on them. To do so, we have a
convenient command line tool, that is powered by LlamaIndex plugins such as
`llama-index-finetuning`.

```sh
docker-compose up --build
arc finetuning cli tool.

options:
-h, --help show this help message and exit

commands:
{evaluate,finetune,job-status}
evaluate Evaluation of ARC Task predictions with LLM and ARCTaskSolverWorkflow.
finetune Finetune OpenAI LLM on ARC Task Solver examples.
job-status Check the status of finetuning job.
```

### Streamlit App
### Submitting a fine-tuning job

Once the services are all running, you can then run the streamlit app. First,
ensure that you have the package's virtual environment active and the environment
variables set. Again, while in the root directory of this project, run the commands
found below:
To submit a fine-tuning job, use any of the following three `finetune` command:

```sh
poetry shell
poetry install
set -a && source .env.local
# submit a new finetune job using the specified llm
arc-finetuning-cli finetune --llm gpt-4o-2024-08-06

# submit a new finetune job that continues from previously finetuned model
arc-finetuning-cli finetune --llm gpt-4o-2024-08-06 --start-job-id ftjob-TqJd5Nfe3GIiScyTTJH56l61

# submit a new finetune job that continues from the most recent finetuned model
arc-finetuning-cli finetune --continue-latest
```

Next, run the streamlit app:
The commands above will take care of compiling all of the single finetuning json
examples (i.e. stored in `finetuning_examples/`) into a single `jsonl` file that
is then passed to OpenAI finetuning API.

### Checking the status of a fine-tuning job

After submitting a job, you can check its status using the below cli commands:

```sh
streamlit run llamaindex_cookbook/apps/streamlit.py
arc-finetuning-cli job-status -j ftjob-WYySY3iGYpfiTbSDeKDZO0YL -m gpt-4o-2024-08-06

# or check status of the latest job submission
arc-finetuning-cli job-status --latest
```

## Evaluation

You can evaluate the `ARCTaskSolverWorkflow` and a specified LLM on the ARC test
dataset. You can even supply a fine-tuned LLM here.

```sh
# evaluate ARCTaskSolverWorkflow single attempt with gpt-4o
arc-finetuning-cli evaluate --llm gpt-4o-2024-08-06

# evaluate ARCTaskSolverWorkflow single attempt with a previously fine-tuned gpt-4o
arc-finetuning-cli evaluate --llm gpt-4o-2024-08-06 --start-job-id ftjob-TqJd5Nfe3GIiScyTTJH56l61
```

You can also specify certain parameters to control the speed of the execution so
as to not run into `RateLimitError`'s from OpenAI.

```sh
arc-finetuning-cli evaluate --llm gpt-4o-2024-08-06 --batch-size 5 --num-workers 3 --sleep 10
```

In the above command, `batch-size` refers to the number of test cases handled in
single batch. In total, there are 400 test cases. Moreover, `num-workers` is the
maximum number of async calls allowed to be made to OpenAI API at any given moment.
Finally, `sleep` is the amount of time in seconds the execution halts before moving
onto the next batch of test cases.
1 change: 0 additions & 1 deletion recipes/llamaindex/arc-finetuning/placeholder.md

This file was deleted.

Loading

0 comments on commit 9882dc4

Please sign in to comment.