A machine-generated text benchmarking framework based upon the original MGTBench project, featuring additional funcionalities that make it easier to integrate custom datasets and new custom detection methods.
The framework also includes a couple of analysis tools for automatic analysis of the benchmark results.
If you use the data, code, or the information in this repository, cite the following paper, please.
@misc{spiegel2023imgtb,
title={{IMGTB}: A Framework for Machine-Generated Text Detection Benchmarking},
author={Michal Spiegel and Dominik Macko},
year={2023},
eprint={2311.12574},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Currently, we support the following methods. To add a new method you can see the documentation below:
- Metric-based methods:
- Model-based methods:
- Other
git clone https://github.com/kinit-sk/IMGTB.git
cd IMGTB
conda env create -f environment.yaml
conda activate IMGTB
or
git clone https://github.com/kinit-sk/IMGTB.git
cd IMGTB
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Run benchmark on locally available test dataset
# By default runs all methods:
python benchmark.py --dataset datasets/test_small.csv
# Run only selected methods:
python benchmark.py --dataset datasets/test_small.csv --methods EntropyMetric roberta-base-openai-detector
# Run Hugging Face detectors on Hugging Face Hub dataset:
python benchmark.py --dataset xzuyn/futurama-alpaca huggingfacehub machine_only output --methods roberta-base-openai-detector Hello-SimpleAI/chatgpt-detector-roberta
# Specify different configurations through command-line arguments:
python benchmark.py --dataset datasets/test_small.csv --methods EntropyMetric --base_model_name prajjwal1/bert-tiny
# Run benchmark using a configurations file:
python benchmark.py --from_config=example_config.yaml
You can specify parameters for the benchmark run in two ways:
- Use command-line arguments
- Create a YAML configuration file
To see a summarization of all of the command-line arguments and options, see either the help message (by including the --help
option in your c) or the lib/config.py
source file.
See example configuration file at example_config.yaml
explaining most of the available features, or see lib/default_config.yaml
to see all currently supported parameters.
Most importantly, a filepath to the dataset will have to specified (or a folder, if your dataset is constructed from multiple files).
We support loading datasets that consist of multiple separate files, just put them into one common directory and put its filepath as the dataset filepath. (However, in this case, you will probably have to define your own dataset processing function, see below for more details on this)
You can define multiple datasets, that way your chosen MGTD methods will be evaluated against multiple datasets. To define more than one dataset use the --dataset
option for each dataset. E.g.:
python benchmark.py --dataset datasets/test_small.csv --dataset datasets/test_small_dir auto test_small_dir
The general dataset definition or usage of the --dataset
option would be:
--dataset FILEPATH FILETYPE PROCESSOR TEXT_FIELD LABEL_FIELD HUMAN_LABEL OTHER
Only required parameter is the dataset filepath, other parameters will be filled in with their default values, if left empty.
We currently support all filetypes listed below, together with the following formats:
- Default processor can parse all table data (including Hugging Face Hub datasets) with separate columns for text and label
- Use different splits
- Different train-test split percentage
- Different subsets and splits for Hugging Face Hub datasets
- Test on machine/human only text by specifying predefined machine/human only dataset processing function
- auto
- csv
- tsv
- xls
- xlsx
- json
- jsonl
- xml
- huggingfacehub
In the special case, that the default processor cannot parse the chosen dataset, a processor function will have to be specifing in the dataset parameters. This is a function that will process your selected dataset files into a unified data format.
In case the provided functionality for parsing datasets is not enough, it is possible to define your own dataset processing function.
A processor is a function defined in the lib/dataset_loader.py
source file as follows:
Name: process_PROCESSOR-NAME (Here, PROCESSOR-NAME will be the selected name of your processor. This would be usually the name of the dataset)
Input: 2 arguments: list of pandas dataframes for each dataset file, configuration dictionary holding the specifics configuration for the currently processed dataset
Output: a dictionary of the following form:
{"train": {
"text": ["example texts here",...],
"label": [1,0,..]
},
"test": {
"text": ["example texts here",...],
"label": [1,0,..]
}
}
Examples usage could be:
python benchmark.py --dataset datasets/test_dataset.csv
python benchmark.py --dataset datasets/test_dataset.csv csv myAwesomeTestProcessor
python benchmark.py --dataset datasets/test_dataset.csv csv myAwesomeTestProcessor ThisIsTextFieldName ThisIsLabelFieldName Human OtherTextDataToBePassedToProcessor
This will tie to the process_myAwesomeDatasetProcessor()
function (it must be implemented beforehand) that will be given the raw content of datasets/test_dataset.csv
.
To integrate a new method, you need to define new Experiment
subclass in the methods/implemented_methods directory
. The main script in benchmark.py
will automatically detect (unless you choose otherwise by configuring the --methods
option) your new method and evaluate it on your chosen dataset.
If implementing a new metric-based or perturbation-based method, it might be useful to implement on of our abstract classes in
methods/abstract_methods/
which will implement a lot of functionality for you, leaving up to you to implement just the specific metric/scoring function.
To implement a new method, you can use one of the templates in the methods/method_templates
. You will just have to fill in the not yet implemented methods and maybe tweak the __init__()
constructor.
Remember to always implement the run()
method (sometimes it's implemented in the parent class). It should always return a JSON-compatible dictionary of results as is defined below.
Each experiment run should return a JSON-compatible dictionary with results with at least the following items:
- name - name of the experiment
- input_data - the data, texts, labels that the method was trained/evaluated on, usually split into train and test sections
- predictions - predicted labels
- machine_prob - predicted probability that a given text is machine-generated
- metrics_results - evaluation of different classification metrics (e.g. Accuracy, Precision, F1...), usually split into train and test sections
### Support for Text Clasisfication Hugging Face Hub models Aside from locally defined Experiment classes, you can specify a Hugging Face Hub Text Classification model as the name of the method in three different ways:
- a string with the shortcut name of a pre-trained model to load from cache or download, e.g.: bert-base-uncased
- a string with the identifier name of a pre-trained model that was user-uploaded to our S3, e.g.: dbmdz/bert-base-german-cased
- a path to a directory containing model weights saved using save_pretrained(), e.g.: ./my_model_directory/
While developing your new method, you might find useful some of the functionality in methods/utils.py
Results of every benchmark run, together with the command-line parameters, will be logged in the results/logs
folder.
At the same time, results of each method and dataset combination will be saved in the results/methods
folder. (In this way, you will be able to see the results of individual method and dataset experiment runs with different parameters together in one place)
Currently it is only possible to run analysis on logs (whole benchmark results). Support for the analysis of the separate method/dataset results will be added.
Results analysis will be run after each benchmark run.
Or you can run analysis from a log manually, specifying the benchmark log filepath and save path to store the analysis results:
results_analysis.py results/logs/SOME_BENCHMARK_RESULTS.json SAVE_PATH
Currently, we are able to visualize:
-
Multiple metrics (Accuracy, Precision, Recall, F1 score) evaluated on the test data partition
-
ROC curve analysis for metric-based methods
-
F1 score for multiple different text length groups
-
Prediction Probability Distribution - How much and how often is the detection method sure of its predictions
-
Prediction Probability Error Distribution - How far was the prediction from true label, how often
-
False Positives Analysis - Analyzes the predictions for solely the negative samples. It uses the following terminology:
Label Prediction Probability TN True Negative 0-20% machine PTN Potentially True Negative 20-40% machine UNC Unclear 40-60% machine PFP Potentially False Positive 60-80% machine FP False Positive 80-100% machine -
False Negatives Analysis - Analyzes the predictions for solely the positive samples. It uses the following terminology:
Label Prediction Probability FN False Negative 0-20% machine PFN Potentially False Negative 20-40% machine UNC Unclear 40-60% machine PTP Potentially True Positive 60-80% machine TP True Positive 80-100% machine -
Per-method per-dataset running time
-
Per-method running time over multiple datasets
You can add your own analysis method by defining it in the results_analysis.py
source file.
The framework was built upon the original MGTBench project, designed and developed by Michal Spiegel (KINIT) under the supervision of Dominik Macko (KINIT).
Credit for the original MGTBench tool, created as a part of the MGTBench: Benchmarking Machine-Generated Text Detection paper, goes to its original designers and developers: Xinlei He (CISPA), Xinyue Shen (CISPA), Zeyuan Chen (Individual Researcher), Michael Backes (CISPA), and Yang Zhang (CISPA).