A Quick Way To Check For Contamination In Datasets

This is just a simple script to use roberta to find paraphrased or directly copied benchmark examples in a dataset.

You can replace the files with the ones you intend to use and set the batch size and similarity threshold to whatever you want.

It's made to run on GPU because it's faster to do it this way.

git clone https://github.com/Kquant03/Benchmark-Contamination-Checker
cd Benchmark-Contamination-Checker
pip install -r requirements.txt
python3 cosine.py

I tested it against benchmark questions I paraphrased myself and it worked fine. I needed this for a dataset I generated with Nemotron.

Note: If you're going to test it against another benchmark, reformat it to where the only felds are "question" and "answer"

Also, it's made to test against ShareGPT datasets. So my benchmark is reformatted to question/answer jsonl and my dataset I'm testing is ShareGPT.

Finally, a 12GB 3060 can handle 2-4 processes while an SXM A100 80GB can handle about 24.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
arc_c_test_final.jsonl		arc_c_test_final.jsonl
cheating_examples.jsonl		cheating_examples.jsonl
cosine.py		cosine.py
generated_conversations.jsonl		generated_conversations.jsonl
mmlu_final.jsonl		mmlu_final.jsonl
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Quick Way To Check For Contamination In Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Kquant03/Benchmark-Contamination-Checker

Folders and files

Latest commit

History

Repository files navigation

A Quick Way To Check For Contamination In Datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages