GRACE-Benchmark: Standardized Corpora for Evaluating AI Tools in CSEM Investigations

The GRACE-Benchmark is a collection of curated datasets to evaluate AI tools for CSEM analysis. This is a result of the GRACE project that aimed to equip European LEAs with advanced analytical and investigative capabilities to respond to the spread of online CSEM. We explore diverse applications in CSEM investigations across various AI tasks, including, face recognition, age/gender identification, video super resolution, speech recognition, language identification, text classification, NER, and many others.

Different processes were applied to obtain benchmark corpora for image, video, audio, and text processing applications in the context of CSEM. These processes involved data selection and augmentation strategies adapted to each application and data source. We ensured that all considered datasets were publicly available and licensed to permit derivative works.

The full dataset can be requested, filling the form at

https://opendatasets.vicomtech.org/di01-grace-benchmark/48f947cc

Then, an automatic email will be sent with temporary links to download either the full benchamrk, or the tool-specific datasets

Image and Video Benchmarks

The GRACE benchmark datasets for Image and Video processing tools are formed by transforming source datasets to the CSEM domain. The source corpora were balanced in terms of age and gender of the persons. Data augmentation processes for these applications included image blurring and face occlusions.

Image/Video Benchmark	Source Datasets	Access Link
Face Recognition	VGGFace2
Visual Age and Gender Estimation	UTKFace, AgeDB, APPA-REAL, IMDB-Face
Tattoo and Scar Detection	DeMSI
Video Super Resolution	REDS
Object Detection and Recognition	NYU Depth V2
Scene Text Recognition	ICDAR 2019
Image meme/viral Detection	Kaggle Memes, TextOCR
Visual Tampering Detection	MISD

Audio Benchmarks

Audio benchmarks were created to evaluate each AI tool in more realistic settings, close to the CSEM domain. The process involved three strategies applied to each source dataset. The data augmentation strategies include noise addition, auddio event insertion, and child speech simulation.

Audio Benchmark	Source Datasets	Access Link
Speech Recognition and Keyword spotting	CommonVoice, MediaSpeech, Voxforge, Spoken wikipedia corpus, Polish Parl
Audio event detection	Audioset
Speaker Identification	Voxceleb
Acoustic Age and Gender Estimation	NISP
Language Identification	Voxforge
Dialect Identification	CommonVoice, L2 Artic, Google Speech Resources, Voxforge

Text Benchmarks

The source corpora for the text-based applications included NER, relationship extraction, and text classification. The selection and augmentation processes used to adapt each corpus to the CSEM domain involved balancing the entity types and incorporating different noise additions.

Text Benchmark	Source Datasets	Access Link
NER	WNUT'17
Relationship extraction	DOCRED
Text Classification	DUTA

Authors

This dataset was collected within the GRACE project, by different researchers from VICOMTECH, CERTH, and GVIS

In partricular, the following researchers have collaborated in the dataset creation and curation process:

Juan Camilo Vásquez-Correa
Aitor García-Pablos
Aitor Álvarez
Leyanis López-Ávila
Javier Calle-Armendariz
Konstantinos Karageorgos
Kassiani Zafeirouli
Anastasios Dimou
Petros Daras
Alicia Martínez-Mendoza
Andrés Carofilis-Vasco
Enrique Alegre-Gutiérrez
Peter Leskovský
Arantza Del Pozo

Contact

{jcvasquez,pleskovsky,adelpozo}@vicomtech.org

License

To be defined

Other relevant information

If you use this dataset, please, cite the following paper:

(INCLUDE PAPER REFERENCE WHEN PUBLISHED)

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
01_face_recognition		01_face_recognition
02_visual_age_gender_recognition		02_visual_age_gender_recognition
03_tattoo_scar_detection		03_tattoo_scar_detection
04_video_super_resolution		04_video_super_resolution
05_object_detection		05_object_detection
06_scene_text_recognition		06_scene_text_recognition
07_image_meme_viral_detection		07_image_meme_viral_detection
08_visual_tampering_detection		08_visual_tampering_detection
09_speech_recognition		09_speech_recognition
10_audio_event_detection		10_audio_event_detection
11_speaker_identification		11_speaker_identification
12_acoustic_age_gender_estimation		12_acoustic_age_gender_estimation
13_language_identification		13_language_identification
14_dialect_identification		14_dialect_identification
15_named_entity_recognition		15_named_entity_recognition
16_relationship_extraction		16_relationship_extraction
17_text_classification		17_text_classification
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRACE-Benchmark: Standardized Corpora for Evaluating AI Tools in CSEM Investigations

Image and Video Benchmarks

Audio Benchmarks

Text Benchmarks

Authors

Contact

License

Other relevant information

About

Releases

Packages

Vicomtech/GRACE-Benchmark

Folders and files

Latest commit

History

Repository files navigation

GRACE-Benchmark: Standardized Corpora for Evaluating AI Tools in CSEM Investigations

Image and Video Benchmarks

Audio Benchmarks

Text Benchmarks

Authors

Contact

License

Other relevant information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages