Skip to content

GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, MS, and ID and is annotated by native speakers of each language. Each sentence is collected from a public corpora of language and contains at least one identity group name and an attribute.

License

Notifications You must be signed in to change notification settings

google-research-datasets/GeniL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GeniL

This repository contains data resources for the following paper:

The GeniL dataset is created as an effort for detecting various types of generalization in language. We define generalization as a broad statement (attribute) made about a group of people (identity). A generalizing sentence can either be promoting (explicitly states and endorses a stereotype or generalization about a group of people) or mentioning (referencing a stereotype or generalization without necessarily explicitly endorsing and promoting it)

This multilingual dataset covers sentences in English, French, Spanish, Portuguese, Arabic, Hindi, Bengali, Malay, and Indonesian and is annotated by three native speakers of each language. Each sentence is collected from a public corpora of language (mC4; multilingual Common Crawl) and contains at least one identity group name and an attribute (identity and attribute pairs are selected from the Multilingual SeeGULL dataset.)

Dataset Description

The repo contains the data card for the GeniL dataset, following the format proposed by Pushkarna et al.. The data card includes details of the dataset such as intended usage, field names and meanings, annotator recruitment and payments. Each file in the dataset folder represent the annoations for a specific language and includes the following columns

  • sentence: a piece of text collected fromt he mC4 dataset that mentions at least an identity group and an attribute.
  • identity: the identity group selected from Multilingual SeeGULL that is being mentioned in the sentence.
  • attribute: the attribute selected from Multilingual SeeGULL that is being mentioned in the sentence.
  • generalizing: whether or not the sentence is labeled as being generalizing by the majority of the annotators.
  • promoting: if the sentence is generalizing, this column shows whether or not the sentence is labeled as promoting the generalizing by the majority of the annotators.

Citation

@misc{davani2024genil,
      title={GeniL: A Multilingual Dataset on Generalizing Language},
      author={Aida Mostafazadeh Davani and Sagar Gubbi and Sunipa Dev and Shachi Dave and Vinodkumar Prabhakaran},
      year={2024},
      eprint={2404.05866},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, MS, and ID and is annotated by native speakers of each language. Each sentence is collected from a public corpora of language and contains at least one identity group name and an attribute.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published