Skip to content

Latest commit

 

History

History
58 lines (39 loc) · 3.77 KB

README.md

File metadata and controls

58 lines (39 loc) · 3.77 KB

Fair Entity Matching

A fairness suite for auditing Entity Matching approaches

Companion repository for the paper "Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching".

Publication(s) to cite:

[1] Nima Shahbazi, Nikola Danevski, Fatemeh Nargesian, Abolfazl Asudeh, and Divesh Srivastava. "Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching." Proceedings of the VLDB Endowment 16, no. 11 (2023): 3279-3292.

[VLDB Publication] https://dl.acm.org/doi/abs/10.14778/3611479.3611525
Technical Report
VLDB Slides

⚠️ Notice:

For easier reproducibility of the experiment results from the paper please see this repository.

Installation

  • Clone the repo
  • Create a virtual environment using e.g., venv or Conda
  • Install any missing packages using e.g., pip or Conda
    • main packages are fairly standard (e.g., Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Py-EntityMatching)

Usage

Familiarize yourself with an example:

fairEM/run_example.py can be used to use our framework to look into the fair behavior of the models on NoFlyCompas dataset.

Data for reproducing the results:

  • Train/Test/Valid/TableA/TableB data for all the datasets in the accepted format by each: Link
    • Please note that you do not need these data to reproduce the results. Thes data are only used if interested user wants to (re-)train the used (or any other entitymatching) models in this study.
  • Model Predictions: Link
    • Please note that you need this data in order to recreate the results of our study. It includes the predictions of the 13 matchers for 8 datasets. The test sets are also included in the provided link. Test sets are placed in the Deepmatcher folder for each dataset.

Reproducing the results:

  • By putting the provided predictions and test data in the specified locations in fairEM/experiments.py file, the results (plots) of the study can be generated.
  • fairEM/threshold_experiments.py can be used to regenerate the heatmaps regarding the effect of matching threshold on the fairness and accuracy of the models.
  • fairEM/case_study_analysis.py can be used to look into the model's behavior on specific cases such as TPs, FPs, FNs and TNs.

Non-neural matchers:

  • Examples regarding rule-specifications for rule-based matcher and the settings used for non-neural matchers are brought in utils/entitymatching examples directory.

Synthetic data generator:

  • In synthetic data generator/FacultyMatch and synthetic data generator/NoFlyCompas paths, the scripts that can be used to generate synthetic socail data for entity matching are provided. Users can employ these scripts to create such datasets with a variety of settings such as limiting the rate on non-matches in the output (i.e. manual blocking), change the number and type of perturbations and etc.

Notice

This project is still under development, so please beware of potential bugs, issues etc. Use at your own responsibility in practice.

Contact

Feel free to contact the authors or leave an issue in case of any complications. We will try to respond as soon as possible.

License

This project is licensed under the MIT License — see the LICENSE.md file for details.