Skip to content

feedzai/data-bias-fraud-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Data-Centric Study on Unfairness in Fraud Detection

This is the repository for the KDD 2023 Applied Data Science Track submission "A Data-Centric Study on Unfairness in Fraud Detection".

This repository contains:

  • Code and data to reproduce the plots shown in the results section of the paper.
  • Code on how to reproduce the paper's experiments on a realistic, publicly-available, state-of-the-art bank account fraud dataset suite.

Key Contributions

  • A formal taxonomy to characterize data bias between a protected attribute, other features, and the target variable.
  • Experimental results for a comprehensive suite of scenarios regarding fairness-accuracy trade-offs ML models make under distinct types of data bias, pertinent, but not restricted to, fraud detection.
  • Demonstrating how models can shape data bias, and consequently unfairness, in dynamic environments.
  • Showing how, by changing data bias settings, the picture of algorithmic fairness changes, and how comparisons among algorithms differ.
  • Raising awareness to the issue of variance in fairness measurements, underlining the importance of employing robust models and metrics.
  • Evaluation of the utility of simple unfairness mitigation methods under distinct data bias conditions.

Plot Reproducibility

Scenario 5 plot from the paper (fraudster adversarial behaviour).

  • paper_plots.ipynb contains code to reproduce each plot in the results section of the paper.
  • results_data/ folder contains the trained models' evaluation results for each experiment, which are used to create the plots.

Running experiments on a public dataset.

  • The notebook baf_experiments.ipynb contains code to reproduce the experiments of the paper on Bank Account Fraud (BAF), a publicly-available bank account fraud dataset suite (the most similar to the one we used).

    • This suite contains a total of 6 realistic fraud datasets (one base dataset, and 5 variants), in which each dataset has a type of data bias, such that they can be used to reproduce some of the experiments conducted in our paper.
    • For example, the following correspondence can be made between the suite's dataset and the data bias Scenarios we analyzed in the paper:
      • Base dataset for the baseline
      • Variant I for Scenario 1
      • Variant II for Scenario 2
      • Variant III for Scenario 3
      • Variant V for Scenario 5

    Reproducibility for Scenarios 4 and 6 are a work in progress.

This code considers the hyperparameter configurations used in the paper's experiments (sampled from the grids in folder hyperparameter_spaces).

We are unable to provide further information on the original data due to privacy concerns.

Citing

WIP.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published