Code repository for the paper: Supposedly Equivalent Facts That Aren't? Entity Frequency in Pre-training Induces Asymmetry in LLMs
factprobe
: main code for probingdata_index
: code for Dolma pre-training data indexing- Download probing datasets from: https://zenodo.org/records/15092789
To set up the project, you'll need Poetry (a modern Python package manager). If you don't have Poetry installed, install it first:
curl -sSL https://install.python-poetry.org | python3 -
Then clone and install the project:
poetry install
Make sure you have the necessary GPU drivers and libraries installed (e.g., CUDA).
To run the main probing script, use the following command:
poetry run python probe.py -c path/to/config.yaml