This codebase contains the scripts used to collect the prompts in berkeley-nest/Nectar as well as generate the seven-wise comparisons. In addition, we include data from ablations and the corresponding Jupyter notebooks to visualize the experimental data.
The code has been verified on Python 3.10, but other versions of Python are likely compatible.
Simply run the command below to download required packages:
pip install -r requirements.txt
Code for collecting prompts is found in the prompts directory. Public datasets were compiled into one dataset via create_data.ipynb, including some basic heuristics for duplicate detection.
The prompts/visualizations directory contains the Jupyter notebook used to generate figures for the paper.
Code for distillation is found in the distillation directory. Distill.py contains the script used to inference various models in parallel.
Code for inferencing the seven-wise ratings is found in the rating directory. Specially, rate.py contains the final code used to generate the Nectar ratings for all 180k rows.
The ratings/visualizations directory contains variance Jupyter Notebooks to generate visualizations. The data for these visualizations can be found in the rating/results directory. Inside each directory in the results directory has a prompt_log.txt
containing the prompt used, log_args.txt
containing the rating script args used, and also rankings.jsonl
containing the outputted rankings from the experiment.
The ratings/experiments folder contains some extra scripts specifically for running certain experiments. Their outputs are found in the directories contained in rating/results with the associated names.
-
measure_k_position_bias.py: Measures how positional bias changes as K increases.
-
measure_k_to_pairwise.py: Measures how judgment agreement with pairwise ratings changes as K increases.
-
rate_pointwise.py: Creates pointwise ratings instead of pairwise ratings.
-
rate_verbose.py: Tests ratings with more explicit anti-verbosity prompting.