Install the conda environment.
conda env create -f environment.yml
conda activate erllm
# avoid cuda coming with sentence-transformer, pytorch is already installed by environment.yml
(erllm) pip install --no-deps sentence-transformers==2.2.2
(erllm) python erllm_setup.py
We recommend erllm.discarding_selective_matcher over erllm.discarding_matcher and erllm.selective_matcher. The latter were coded first and a are subsumed by erllm.discarding_selective_matcher.
Module | Purpose |
---|---|
erllm | Root package. Contains installation, documentation generation and helper code. |
erllm.calibration | Calibration analysis on entity matching LLM predictions. |
erllm.dataset | Covers entity representation, dataset loading and downsampling. |
erllm.dataset.dbpedia | Handles DBPedia data including loading raw data into SQLite, interaction, and generation of labeled datasets using token blocking for benchmarking. |
erllm.dataset.ditto | Convert existing datasets to DITTO format. |
erllm.discarder | Explores the similarity-based discarder in isolation. Computes and saves set-based and embedding-based similarities for pairs of entities, Includes functionality to save results and computation time into similarity files, compute various discarder statistics, and generate visualizations. |
erllm.discarding_matcher | Simulates and evaluates the similarity-based discarding matcher. Contains generation of performance plots, and analysis of time/performance trade-off. |
erllm.discarding_selective_matcher | Implements the discarding selective matcher. It includes functionalities for assessing classification performance, generating comparison tables and creating contour plots. |
erllm.ditto | Support for configuring DITTO to run on the DITTO datasets and subsequent evaluation and comparison to selective matcher. |
erllm.llm_matcher | Contains code to create prompts from datasets and get responses via OpenAI's API. These are saved into run files which serve as cache for all composite matchers. Also contains code to run and evaluate the LLM matcher. |
erllm.selective_classifier | Supports running selective classification on various datasets, evaluating the performance over ranges of threshold/coverage parameters, and generating tables and plots to visualize the classification performance. |
erllm.selective_matcher | Implements and evaluates the selective matcher and random labeling. Supports running both methods across parameter ranges and datasets and generating comparison tables. |
erllm.serialization_cmp | Compares entity serialization schemes, evaluating their performance with and without attribute names. Also evaluates the impact of data errors. |
Module | Purpose |
---|---|
erllm_setup.py | Add .pth file to the site-packages directory of the current Python interpreter to make erllm discoverable. |
gen_docs.py | Generate a package overview table and a table for each package's subfiles in a markdown file. |
utils.py | Utility functions for various tasks including file operations, mathematical calculations, and data manipulation. |
Module | Purpose |
---|---|
calibration_plots.py | Performs calibration analysis on language model predictions for different datasets. Calculating Brier Score and Expected Calibration Error (ECE). |
confidence_hist.py | Generate histograms of confidence scores per outcome (TP, TN, FP, FN). |
reliability_diagrams.py | Third party code from https://github.com/hollance/reliability-diagrams with some small changes. Calibration computation and visualization using reliability diagrams. |
Module | Purpose |
---|---|
entity.py | Contains Entity and OrderedEntity classes to represent entities and serialize them into strings for use in prompts. |
load_ds.py | Provides functions for loading benchmark data from CSV files into pandas DataFrames or lists of tuples representing entity pairs. |
sample_ds.py | Provides a function for sampling elements from a dataset while preserving the label ratio. |
stats_ds.py | This module provides functions to compute dataset statistics like the number of pairs. |
Module | Purpose |
---|---|
access_dbpedia.py | Access the DBPedia SQLite database after it has been created by load_dbpedia.py. |
load_dbpedia.py | Loads data from .txt file and loads it into SQLite tables. The primary tables store DBpedia entities with key-value pairs, and an additional table stores matching pairs. |
sample_dbpedia.py | Provides functions for generating a sample dataset of entity pairs from the DBPedia database. The dataset includes both matching and non-matching pairs of entities. The matching pairs are generated based on known matches, non-matching pairs are generated by token blocking on random entities. |
token_blocking.py | Provides functions for token blocking and clean token blocking in entity resolution tasks. |
Module | Purpose |
---|---|
to_ditto.py | Provides functions for converting labeled pairs of entities to Ditto format and split them into train, validation, and test sets. |
to_ditto_runner.py | Generates Ditto datasets from existing datasets. |
Module | Purpose |
---|---|
discarder.py | This module provides functions for computing set-based and embedding-based similarities for pairs of entities within a given dataset. The set-based similarities include Jaccard, Overlap, Monge-Elkan, and Generalized Jaccard, while embedding-based similarities use cosine and Euclidean distance metrics. Saves the results and computation time into similarity files which serve as cache for composite matchers including a discarder. |
discarder_eval.py | Computes various functions from similarity files, such as the number of false negatives as function of the number of discarded pairs. |
discarder_vis.py | Generates plots to visualize evaluation discarder statistitcs. It includes functions to plot specific relations for a given dataset and generate combined plots for multiple datasets, offering insights into various metrics such as false negatives, risk, false negative rate, and coverage. |
Module | Purpose |
---|---|
discarding_matcher.py | This module provides functions for evaluating the performance of a discarding matcher utilizing run and similarity files. It calculates classification, cost and duration metrics. |
discarding_matcher_duration_cmp.py | Calculates speedup factor of discarding matcher over LLM matcher. |
discarding_matcher_runner.py | Runs the discarding matcher algorithm on multiple datasets with different threshold values. It calculates various performance metrics such as accuracy, precision, recall, F1 score, cost, and duration. |
discarding_matcher_tradeoff.py | Generate and analyze performance/cost trade-off for the discarding matcher based on F1 decrease thresholds. Calculates F1 decrease, relative cost, and relative duration for each dataset and threshold. |
discarding_matcher_tradeoff_abs.py | Create tables of absolute cost and time required to run the LLM matcher and the discarding matcher at various F1 decrease thresholds. |
discarding_matcher_vis.py | Generates performance comparison plots for the discarding matcher. |
Module | Purpose |
---|---|
discarding_selective_matcher.py | Implements the discarding selective matcher and includes functions for evaluating its classification performance, cost and duration. |
discarding_selective_matcher_allstats_table.py | Creates a table for comparing different matcher architectures based on their discarding error, cost, time and classification metrics. |
discarding_selective_matcher_contour.py | Create contour plots which map the discard and label fractions to the mean F1, precision, recall. |
discarding_selective_matcher_eval.py | Calculates the mean values across datasets for specified metrics, based on the results obtained by running the discarding selective matcher. |
discarding_selective_matcher_metric_table.py | Create a table which shows one metric like mean F1 across different label and discard fractions. |
discarding_selective_matcher_runner.py | Runs and evaluates the discarding selective matcher for various configurations. |
discarding_selective_matcher_sample_vs_full.py | Ccombines and compares the results on full dataset and ampled version using different configurations of a discarding selective matcher (DSM) algorithm. It generates a comparison table of the classification performance for each configuration. |
Module | Purpose |
---|---|
add_to_ditto_configs.py | Copy the datasets in DITTO format to the ditto folder into the subfolder data/erllm. Add the new datasets to the configs.json file in the ditto folder. |
ditto_combine_predictions.py | Based on the stats of train and valid set and the results of running ditto, calculate the precision, recall, and F1 score for the total dataset. |
sm_ditto_comparison.py | Creates a table containing F1 scores for DITTO and SM across all datasets. |
Module | Purpose |
---|---|
cost.py | Provides cost calculations for language models based on specified configurations, including input and output costs. |
evalrun.py | Methods for reading run files, deriving classification decisions, and calculating classification and calibration metrics. |
gpt.py | Module for obtaining completions from the older OpenAI Completions API. |
gpt_chat.py | Module for obtaining completions from the newer OpenAI Chat Completions API. |
llm_matcher.py | Provides functions to evaluate the performance of the LLM matcher on a set of run files obtained from OpenAI's API. It calculates various classification metrics, entropies, and calibration results. |
prompt_data.py | Handles serialization of labeled entity pairs and saves result into JSON. |
prompts.py | Combines serialized entities from JSON file with prompt prefix/postfix to create full prompts passed to OpenAI's API. |
Module | Purpose |
---|---|
selective_classifier.py | Run and evaluate selective classification. |
selective_classifier_runner.py | Runs selective classification over ranges of threshold/coverage parameters on multiple datasets. |
selective_classifier_tab.py | Create table of F1 scores per dataset for different coverages. |
selective_classifier_vis.py | Generates classification performance comparison plots for selective classification. |
Module | Purpose |
---|---|
random_table.py | Create a latex comparison table of F1 scores between LLM matcher and random labeling at different label fractions. |
random_table_with_sd.py | Creates a latex table displaying std. dev. of F1 scores for different fractions of random labeling. |
selective_matcher.py | Implements the selective matcher and the labeling of randomly chosen predictions. It applies these to predictions on different datasets and calculates various classification metrics. |
selective_matcher_runner.py | This script runs and evaluates the selective matcher across parameter ranges and datasets. |
selective_matcher_vs_base_table.py | Create a latex comparison table of F1 scores between LLM matcher and selective matcher at different label fractions. |
Module | Purpose |
---|---|
attribute_comparison.py | Creates per dataset and mean comparison tables for comparing entitiy serialization schemes with and without attributes names. |
data_errors.py | Generate comparison table of LLM matcher's mean F1, precision and recall across datasets in presence of data errors. |
The datasets are already prepared under data/benchmark_datasets/existingDatasets
.
For completeness we outline how to download the datasets (all except DBPedia) published by by Papadakis, George, Nishadi Kirielle, Peter Christen, and Themis Palpanas.
-
Download the Archive: Download the
magellanExistingDatasets.tar.gz
file from https://zenodo.org/records/8164151. -
Navigate to the Directory: Open a terminal or command prompt and navigate to the location where the downloaded file is stored.
-
Extract the Archive: Use the following command to extract the contents of the archive:
tar -xvzf magellanExistingDatasets.tar.gz
-
Move the Directory: After extraction, move the existingDatasets directory to the desired location (data/benchmark_datasets in this case):
mv existingDatasets data/benchmark_datasets/
-
Verify the Structure: Confirm that the directory structure now looks like this:
├── data │ └── benchmark_datasets │ └── existingDatasets │ ├── ... (contents of the existingDatasets directory)
Each dataset consists of five CSV files: tableA.csv
, tableB.csv
, test.csv
, train.csv
, and valid.csv
.
tableA.csv
and tableB.csv
contain the entity descriptions in full. For example, the beer dataset is formatted as follows:
id,Beer_Name,Brew_Factory_Name,Style,ABV
12,Lagunitas Lucky 13 Mondo Large Red Ale,Lagunitas Brewing Company,American Amber / Red Ale,8.65%
13,Ruedrich's Red Seal Ale,North Coast Brewing Co.,American Amber / Red Ale,5.40%
14,Boont Amber Ale,Anderson Valley Brewing Company,American Amber / Red Ale,5.80%
15,American Amber Ale,Rogue Ales,American Amber / Red Ale,5.30%
test.csv
, train.csv
, and valid.csv
contain labeled pairs and repeat the entity descriptions. For example:
_id,label,table1.id,table2.id,table1.Beer_Name,table2.Beer_Name,table1.Brew_Factory_Name,table2.Brew_Factory_Name,table1.Style,table2.Style,table1.ABV,table2.ABV
0,0,1219,2470,Bulleit Bourbon Barrel Aged G'Knight,Figure Eight Bourbon Barrel Aged Jumbo Love,Oskar Blues Grill & Brew,Figure Eight Brewing,American Amber / Red Ale,Barley Wine,8.70%,-
1,0,492,1635,Double Dragon Imperial Red Ale,Scuttlebutt Mateo Loco Imperial Red Ale,Phillips Brewing Company,Scuttlebutt Brewing Co.,American Amber / Red Ale,American Strong Ale,8.20%,7.10%
2,1,3917,2224,Honey Basil Amber,Rude Hippo Honey Basil Amber,Rude Hippo Brewing Company,18th Street Brewery,American Amber / Red Ale,Amber Ale,7.40%,7.40%
The subsampled DBpedia versions used in the thesis are already contained in the repo. If you want to work with the full DBPedia dataset directly, follow these instructions.
- Download the Archive:
Download the archive
dbpediaText.tar.gz
from https://zenodo.org/records/10059096. - Extract and move to correct directory
Extract the archive and move the files
cleanDBPedia1out
,cleanDBPedia2out
,newDBPediaMatchesout
to data/dbpedia_raw. - Verify the Structure:
Confirm that the directory structure now looks like this:
├── data │ └── dbpedia_raw │ └── cleanDBPedia1out │ └── cleanDBPedia2out │ └── newDBPediaMatchesout
The python file in erllm/dataset/dbpedia are used to create the sampled DBPedia dataset used in the work.
The JedAIToolkit contains the original copy of the DBPedia dataset in .jso format, which is a serialized Java object.
To make this dataset easier to use, we submitted a pull request to convert it to .txt files.
We shared these with the authors of JedAIToolkit who uploaded it to https://zenodo.org/records/10059096.
We do not use .csv because there is no fixed schema.
The files cleanDBPedia1out
, cleanDBPedia2out
contain the entities.
Each line corresponds to a different entity profile and has the following structure (where n is number of attributes, aname and aval are the attribute names and values):
numerical_id , uri , n , aname_0 , aval_0 , aname_1 , aval_1 ,...aname_n , aval_n
That is, the separator is space,space
.
,
in the original data have been replaced with ,,
.
This must be accounted for when reading the data.
The file newDBPediaMatchesout
contains matching profile pairs.
Each line has the format:
numerical_id_0 , numerical_id_1
Except for DBpedia, use sample_ds.py to generate subsampled versions.
Our subsampled versions are already present under data/existingDatasets, each stored in a directory with the _1250 suffix.
We use unsupervised approaches and thus combine the pairs sampled from the original test.csv
, train.csv
, and valid.csv
in a single test.csv
.
If you want to generate new DBpedia samples you first need to obtain the full DBpedia dataset as outlined in Full DBPedia.
prompt_data.py serializes the entities in a dataset to a string using a serialization function which determines how the entities are represented in the prompts. The output is a JSON file.
prompts.py takes this file as input and adds a prompt pre- and postfix. An example prefix is "Do the two entity descriptions refer to the same real-world entity? Answer with 'Yes' if they do and 'No' if they do not." The output is another JSON file.
gpt.py reads this file and constructs the full prompt from pre- and postfix and the serialized profile pair. It then sends the prompt to the OpenAI API and retrieves the response which contains token probabilities and more. gpt.py saves each prompt together with the response into a JSON run file.
All composite matchers use these run files with the cached API response and do not query the API live. This saves cost and time. For the discarder, discarder.py precomputes the examined similarity functions between all profile pairs in all datasets. We save the similarity value and computation duration for all pairs of a dataset into one similarity file per dataset. All composite matchers containing a discarder use this similarity file with the precomputed results.
The run files and similarity files are the only basic data relevant for the simulation of all matcher architectures and data analysis (e.g. generating confidence histograms). This means that all the other scripts in erllm operate on these data or derived information.