The Abstraction and Reasoning Corpus (ARC) for Artificial General Intelligence benchmark aims to measure an AI system's ability to efficiently learn new skills. Each task within the ARC benchmark contains a unique puzzle for which the systems attempt to solve. Currently, the best AI systems achieve 34% solve rates, whereas humans are able to achieve 85% (source).
Motivated by this large disparity, we built this app with the goal of injecting human-level reasoning on this benchmark to LLMs. Specifically, this app enables the collaboration of LLMs and humans to solve an ARC task; and these collaborations can then be used for fine-tuning the LLM.
The Solver itself is a LlamaIndex Workflow
that relies on successive runs for
which Context
is maintained from previous runs. Doing so allows for an
effective implementation of the Human In the Loop Pattern.
Note
All of the task json files used for this application that is stored in data/
has been directly pulled from ARC-AGI Github repository (see here).
We first must install the app's dependencies. To do so, we can use poetry
:
poetry shell
poetry install
Finally, to run the streamlit app:
streamlit run arc_finetuning_st/streamlit/app.py
This app uses OpenAI LLMs. As such, you will need to provide a valid api key to execute the solver. You pass your api key in the designated text input within the sidebar of the applications.
Next, we discuss how to use the app in order to solve a given ARC task.
Each ARC task consists of training examples, each of which consist of input and output pairs. There exists a common pattern between these input and output pairs, and the problem is solved by uncovering this pattern, which can be verified by the included test examples.
To solve the task, we cycle through the following three steps:
- Prediction (of test output grid)
- Evaluation
- Critique (human in the loop)
Under the hood a LlamaIndex Workflow
implements these three steps
.
Step 1. makes use of an LLM to produce the Prediction whereas Step 2. is deterministic and is a mere comparison between the ground truth test output and the Prediction. If the Prediction doesn't match the ground truth grid, then Step 3. is performed. Similar to step 1. an LLM is prompted to generate a Critique on the Prediction as to why it may not match the pattern underlying the train input and output pairs. However, we also allow for a human in the loop to override this LLM generated Critique.
The Critique is carried on from a previous cycle onto the next in order to generate an improved and hopefully correct next Prediction.
To begin, click the Start
button found in the top-right corner. If the
prediction is incorrect, you can view the Critique produced by the LLM in the
designated text area. You can choose to use this Critique or supply your own by
overwriting the text and applying the change. Once ready to produce the next
prediction, hit the Continue
button.
Any collaboration session involving the LLM and human can be saved and used to
finetune an LLM. In this app, we use OpenAI LLMs, and so the finetuning examples
adhere to the OpenAI fine-tuning API.
Click the fine-tuning example
button during a session to see the current
example that can be used for fine-tuning.
After you've created your finetuning examples (you'll need at least 10 of them),
you can submit a job to OpenAI to finetune an LLM on them. To do so, we have a
convenient command line tool, that is powered by LlamaIndex plugins such as
llama-index-finetuning
.
arc finetuning cli tool.
options:
-h, --help show this help message and exit
commands:
{evaluate,finetune,job-status}
evaluate Evaluation of ARC Task predictions with LLM and ARCTaskSolverWorkflow.
finetune Finetune OpenAI LLM on ARC Task Solver examples.
job-status Check the status of finetuning job.
Before using cli, you must have set the environment variable OPENAI_API_KEY
with a valid api key.
export OPENAI_API_KEY=<fill-in>
To submit a fine-tuning job, use any of the following three finetune
command:
# submit a new finetune job using the specified llm
arc-finetuning-cli finetune --llm gpt-4o-2024-08-06
# submit a new finetune job that continues from previously finetuned model
arc-finetuning-cli finetune --llm gpt-4o-2024-08-06 --start-job-id ftjob-TqJd5Nfe3GIiScyTTJH56l61
# submit a new finetune job that continues from the most recent finetuned model
arc-finetuning-cli finetune --continue-latest
The commands above will take care of compiling all of the single finetuning json
examples (i.e. stored in finetuning_examples/
) into a single jsonl
file that
is then passed to OpenAI finetuning API.
After submitting a job, you can check its status using the below cli commands:
arc-finetuning-cli job-status -j ftjob-WYySY3iGYpfiTbSDeKDZO0YL -m gpt-4o-2024-08-06
# or check status of the latest job submission
arc-finetuning-cli job-status --latest
You can evaluate the ARCTaskSolverWorkflow
and a specified LLM on the ARC test
dataset. You can even supply a fine-tuned LLM here.
# evaluate ARCTaskSolverWorkflow single attempt with gpt-4o
arc-finetuning-cli evaluate --llm gpt-4o-2024-08-06
# evaluate ARCTaskSolverWorkflow single attempt with a previously fine-tuned gpt-4o
arc-finetuning-cli evaluate --llm gpt-4o-2024-08-06 --start-job-id ftjob-TqJd5Nfe3GIiScyTTJH56l61
You can also specify certain parameters to control the speed of the execution so
as to not run into RateLimitError
's from OpenAI.
arc-finetuning-cli evaluate --llm gpt-4o-2024-08-06 --batch-size 5 --num-workers 3 --sleep 10
In the above command, batch-size
refers to the number of test cases handled in
single batch. In total, there are 400 test cases. Moreover, num-workers
is the
maximum number of async calls allowed to be made to OpenAI API at any given moment.
Finally, sleep
is the amount of time in seconds the execution halts before moving
onto the next batch of test cases.