The GRACE-Benchmark is a collection of curated datasets to evaluate AI tools for CSEM analysis. This is a result of the GRACE project that aimed to equip European LEAs with advanced analytical and investigative capabilities to respond to the spread of online CSEM. We explore diverse applications in CSEM investigations across various AI tasks, including, face recognition, age/gender identification, video super resolution, speech recognition, language identification, text classification, NER, and many others.
Different processes were applied to obtain benchmark corpora for image, video, audio, and text processing applications in the context of CSEM. These processes involved data selection and augmentation strategies adapted to each application and data source. We ensured that all considered datasets were publicly available and licensed to permit derivative works.
The full dataset can be requested, filling the form at
https://opendatasets.vicomtech.org/di01-grace-benchmark/48f947cc
Then, an automatic email will be sent with temporary links to download either the full benchamrk, or the tool-specific datasets
The GRACE benchmark datasets for Image and Video processing tools are formed by transforming source datasets to the CSEM domain. The source corpora were balanced in terms of age and gender of the persons. Data augmentation processes for these applications included image blurring and face occlusions.
Image/Video Benchmark | Source Datasets | Access Link |
---|---|---|
Face Recognition | VGGFace2 | |
Visual Age and Gender Estimation | UTKFace, AgeDB, APPA-REAL, IMDB-Face | |
Tattoo and Scar Detection | DeMSI | |
Video Super Resolution | REDS | |
Object Detection and Recognition | NYU Depth V2 | |
Scene Text Recognition | ICDAR 2019 | |
Image meme/viral Detection | Kaggle Memes, TextOCR | |
Visual Tampering Detection | MISD |
Audio benchmarks were created to evaluate each AI tool in more realistic settings, close to the CSEM domain. The process involved three strategies applied to each source dataset. The data augmentation strategies include noise addition, auddio event insertion, and child speech simulation.
Audio Benchmark | Source Datasets | Access Link |
---|---|---|
Speech Recognition and Keyword spotting | CommonVoice, MediaSpeech, Voxforge, Spoken wikipedia corpus, Polish Parl | |
Audio event detection | Audioset | |
Speaker Identification | Voxceleb | |
Acoustic Age and Gender Estimation | NISP | |
Language Identification | Voxforge | |
Dialect Identification | CommonVoice, L2 Artic, Google Speech Resources, Voxforge |
The source corpora for the text-based applications included NER, relationship extraction, and text classification. The selection and augmentation processes used to adapt each corpus to the CSEM domain involved balancing the entity types and incorporating different noise additions.
Text Benchmark | Source Datasets | Access Link |
---|---|---|
NER | WNUT'17 | |
Relationship extraction | DOCRED | |
Text Classification | DUTA |
This dataset was collected within the GRACE project, by different researchers from VICOMTECH, CERTH, and GVIS
In partricular, the following researchers have collaborated in the dataset creation and curation process:
- Juan Camilo Vásquez-Correa
- Aitor García-Pablos
- Aitor Álvarez
- Leyanis López-Ávila
- Javier Calle-Armendariz
- Konstantinos Karageorgos
- Kassiani Zafeirouli
- Anastasios Dimou
- Petros Daras
- Alicia Martínez-Mendoza
- Andrés Carofilis-Vasco
- Enrique Alegre-Gutiérrez
- Peter Leskovský
- Arantza Del Pozo
{jcvasquez,pleskovsky,adelpozo}@vicomtech.org
To be defined
If you use this dataset, please, cite the following paper:
(INCLUDE PAPER REFERENCE WHEN PUBLISHED)