Skip to content

Latest commit

 

History

History
116 lines (81 loc) · 6.39 KB

README.md

File metadata and controls

116 lines (81 loc) · 6.39 KB

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Gabriele SartiTommaso CaselliMalvina NissimArianna Bisazza

An example of a verbalized rebus Example generations from LLMs

Abstract: Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models' performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models' linguistic proficiency and sequential instruction-following skills.

This repository contains scripts and notebooks associated with the paper "Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses". If you use any of the following contents for your work, we kindly ask you to cite our paper:

@article{sarti-etal-2024-rebus,
    title = "Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses",
    author = "Sarti, Gabriele and Caselli, Tommaso and Nissim, Malvina and Bisazza, Arianna",
    journal = "ArXiv",
    month = jul,
    year = "2024",
    volume = {abs/2408.00584},
    url = {https://arxiv.org/abs/2408.00584},
}

All models and data used in this work are available in our 🤗 Hub Collection.

Try it yourself! 🧩

We provide a simple online demo to test the rebus-solving capabilities of our model. You can access it here.

Installation

To install the required dependencies, you can use the following command:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

To run training and inference notebooks you will need a machine with access to a GPU (required by Unsloth). The environment setup is performed in the first cell of the training notebooks.

Model performances

Model Setup Crossword Definitions First Pass EM Solution EM
LLaMA-3 70B 5-shot prompt 0.22 0.04 0.00
Qwen-2 72B 5-shot prompt 0.28 0.04 0.00
GPT-4o 5-shot prompt 0.55 0.15 0.11
Claude-3.5 Sonnet 5-shot prompt 0.66 0.28 0.24
Gemma-2 2B (ours) fine-tuned 0.78 0.43 0.36
Phi-3 3.8B (ours) fine-tuned 0.84** 0.56 0.51
LLaMA-3.1 8B (ours) fine-tuned 0.85 0.59 0.56

Fine-grained verbalized rebus solving performances of various LLMs. Bold denotes best overall performances. See the paper for more details.

Reproducing the paper results

Data preprocessing

⚠️ Refer to the EurekaRebus dataset card for more information on the dataset and the data licensing.

Run the following command to produce all data contained in the eureka-rebus folder from the rebus.csv file:

python scripts/process_data.py \
  --print_stats \
  --infer_punctuation \
  --generate_filtered_rebuses \
  --create_train_test_sets \
  --save_word_frequencies_train \
  --save_sharegpt_files

Fine-tuning LLMs on EurekaRebus

Follow the notebooks train_phi3_mini and train_llama3.1_8b to fine-tune the models on the EurekaRebus dataset. The training of both models was conducted on a single RTX 3090 GPU with 24GB RAM.

Generating and Parsing Rebus Solutions with Fine-tuned Models

To generate solutions for the rebuses in the test set, follow the instructions in the inference notebook. The folder outputs already contains parsed gold solutions and model predictions across saved training checkpoints for Phi-3 Mini and LLaMA-3.1 8B.

Generating Rebus Solutions with Prompted LLMs

Use the prompt_models script to generate solutions for the rebuses in the test set using prompted LLMs. The script requires the guidance library to be installed. The following command generates solutions for the test set using the Claude 3.5 Sonnet model:

python scripts/prompt_models.py \
  --model claude \
  --api_key YOUR_ANTHROPIC_API_KEY

Resulting files have the same format as the ones produced by the inference notebook, and are saved in the outputs/prompted_models folder.

⚠️ Due to an incompatibility of the guidance library, we used guidance=0.1.14 for TogetherAI models (LLaMA, Qwen) and guidance=0.1.15 for proprietary models (Claude, GPT-4o). We don't guarantee that other versions will work as expected.

Evaluating Results

For a new file containing the parsed predictions for a model produced with the inference notebook, you can compute all metrics from the paper using the following command:

python scripts/evaluate.py \
  --predicted_outputs outputs/phi3_mini/phi3_mini_results_step_500.csv \
  --gold_outputs outputs/test_gold_id_ood.csv \
  --word_frequencies outputs/word_frequencies_paisa.json \
  --word_frequencies_fp_train eureka-rebus/word_frequencies_fp_train.json \
  --word_frequencies_solution_train eureka-rebus/word_frequencies_solution_train.json \
  --do_corrs

To use the command above, if you use the do_corrs option, you should unzip the word_frequencies_paisa.json.zip file in the outputs folder.

Acknowledgments

We would like to thank the Associazione Culturale "Biblioteca Enigmistica Italiana - G. Panini" for their valuable contribution in maintaining the Eureka5 collection up to date and openly accessible.