GRACE-Benchmark: Standardized Corpora for Evaluating AI Tools in CSEM Investigations

The GRACE-Benchmark is a collection of curated datasets to evaluate AI tools for CSEM analysis. This is a result of the GRACE project that aimed to equip European LEAs with advanced analytical and investigative capabilities to respond to the spread of online CSEM. We explore diverse applications in CSEM investigations across various AI tasks, including, face recognition, age/gender identification, video super resolution, speech recognition, language identification, text classification, NER, and many others.

Different processes were applied to obtain benchmark corpora for image, video, audio, and text processing applications in the context of CSEM. These processes involved data selection and augmentation strategies adapted to each application and data source. We ensured that all considered datasets were publicly available and licensed to permit derivative works.

The full dataset can be requested, filling the form at

https://opendatasets.vicomtech.org/di01-grace-benchmark/48f947cc

Then, an automatic email will be sent with temporary links to download either the full benchamrk, or the tool-specific datasets

Image and Video Benchmarks

The GRACE benchmark datasets for Image and Video processing tools are formed by transforming source datasets to the CSEM domain. The source corpora were balanced in terms of age and gender of the persons. Data augmentation processes for these applications included image blurring and face occlusions.

Image/Video Benchmark	Source Datasets	Access Link
Face Recognition	VGGFace2
Visual Age and Gender Estimation	UTKFace, AgeDB, APPA-REAL, IMDB-Face
Tattoo and Scar Detection	DeMSI
Video Super Resolution	REDS
Object Detection and Recognition	NYU Depth V2
Scene Text Recognition	ICDAR 2019
Image meme/viral Detection	Kaggle Memes, TextOCR
Visual Tampering Detection	MISD

Audio Benchmarks

Audio benchmarks were created to evaluate each AI tool in more realistic settings, close to the CSEM domain. The process involved three strategies applied to each source dataset. The data augmentation strategies include noise addition, auddio event insertion, and child speech simulation.

Audio Benchmark	Source Datasets	Access Link
Speech Recognition and Keyword spotting	CommonVoice, MediaSpeech, Voxforge, Spoken wikipedia corpus, Polish Parl
Audio event detection	Audioset
Speaker Identification	Voxceleb
Acoustic Age and Gender Estimation	NISP
Language Identification	Voxforge
Dialect Identification	CommonVoice, L2 Artic, Google Speech Resources, Voxforge

Text Benchmarks

The source corpora for the text-based applications included NER, relationship extraction, and text classification. The selection and augmentation processes used to adapt each corpus to the CSEM domain involved balancing the entity types and incorporating different noise additions.

Text Benchmark	Source Datasets	Access Link
NER	WNUT'17
Relationship extraction	DOCRED
Text Classification	DUTA

Authors

This dataset was collected within the GRACE project, by different researchers from VICOMTECH, CERTH, and GVIS

In partricular, the following researchers have collaborated in the dataset creation and curation process:

Juan Camilo Vásquez-Correa
Aitor García-Pablos
Aitor Álvarez
Leyanis López-Ávila
Javier Calle-Armendariz
Konstantinos Karageorgos
Kassiani Zafeirouli
Anastasios Dimou
Petros Daras
Alicia Martínez-Mendoza
Andrés Carofilis-Vasco
Enrique Alegre-Gutiérrez
Peter Leskovský
Arantza Del Pozo

Contact

{jcvasquez,pleskovsky,adelpozo}@vicomtech.org

License

To be defined

Other relevant information

If you use this dataset, please, cite the following paper:

(INCLUDE PAPER REFERENCE WHEN PUBLISHED)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GRACE-Benchmark: Standardized Corpora for Evaluating AI Tools in CSEM Investigations

Image and Video Benchmarks

Audio Benchmarks

Text Benchmarks

Authors

Contact

License

Other relevant information

Files

README.md

Latest commit

History

README.md

File metadata and controls

GRACE-Benchmark: Standardized Corpora for Evaluating AI Tools in CSEM Investigations

Image and Video Benchmarks

Audio Benchmarks

Text Benchmarks

Authors

Contact

License

Other relevant information