Skip to content

boychaboy/KOLD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

๐Ÿฅถ KOLD: Korean Offensive Language Dataset

Repository for the KOLD dataset, paper accepted in EMNLP 2022 (main, long).
Authors: Younghoon Jeong, Juhyun Oh, Jongwon Lee, Jaimeen Ahn, Jihyung Moon, Sungjoon Park, and Alice Oh
Institutions: KAIST, Softly AI

Note: This dataset must not be used as training data to automatically generate and publish offensive language online, but by publicly releasing it, we cannot prevent all malicious use. We do not condone any malicious use and urge researchers and practitioners to use it in beneficial ways (e.g., to filter out hate speech).

Paper

KOLD: Korean Offensive Language Dataset (arXiv version)
Camera-ready version link TBA

Illustration of Annotation Process

Annotation Process

Examples of KOLD

Examples of KOLD

Target Group Attributes and Target Groups

Target Groups

Data

data/kold_v1.json

[
	{
		"guid": "kold-v1_00000",
		"source": "naver_news",
		"date": "2022-02-16",
		"title": "ํŽ˜๋ฏธ๋‹ˆ์ฆ˜์ด ๋ฒ”์ฃ„๊ฐ€ ๋˜๋Š” ๋‚˜๋ผ [์‚ถ๊ณผ ๋ฌธํ™”]",
		"comment": "๋‚จ๋…€ํ‰๋“ฑ ์ฃผ์žฅํ•  ๊ฑฐ๋ฉด ์—ฌ์„ฑ์ง•๋ณ‘์ œ์—๋„ ๋™์˜ํ•˜๋ผ๊ณ ใ…‹ใ…‹ใ…‹ ๊ทธ๋ฆฌ๊ณ  ๋‚ด ๋ง์— ๊ทธ๋ƒฅ ์‹œ๋น„๋งŒ ๊ฑธ์ง€ ๋ง๊ณ  ํ˜œํƒ์€ ๋‹ค ์ณ๋ฐ›์œผ๋ฉด์„œ ์™œ ์ฐจ๋ณ„๋ฐ›๋Š”๋‹ค๊ณ  ๋งํ•˜๋Š”์ง€ ๋งํ•ด๋ณด๋ผ๊ณ ใ…‹ใ…‹ใ…‹",
		"OFF": True,
		"TGT": "group",
		"GRP": "others-feminist",
		"OFF_span": " ์ณ๋ฐ›์œผ๋ฉด์„œ ์™œ ์ฐจ๋ณ„๋ฐ›๋Š”๋‹ค๊ณ  ๋งํ•˜๋Š”์ง€ ๋งํ•ด๋ณด๋ผ๊ณ ใ…‹ใ…‹ใ…‹"
		"TGT_span": ""
		"raw_labels": [
			  {'offensiveness': True,
			   'annotator_id': 191510,
			   'off_start_idx': [57],
			   'off_end_idx': [84],
			   'target': [['group']],
			   'target_group': [['์ง‘๋‹จ-์„ฑ ์ •์ฒด์„ฑ-์—ฌ์„ฑ']],
			   'tgt_start_idx': [],
			   'tgt_end_idx': []},
			  {'offensiveness': True,
			   'annotator_id': 192109,
			   'off_start_idx': [56],
			   'off_end_idx': [84],
			   'target': [['not specified', 'group']],
			   'target_group': [['์ง‘๋‹จ-์„ฑ ์ •์ฒด์„ฑ-ํŽ˜๋ฏธ๋‹ˆ์ŠคํŠธ', '์•Œ ์ˆ˜ ์—†์Œ']],
			   'tgt_start_idx': [],
			   'tgt_end_idx': []},
			  {'offensiveness': True,
			   'annotator_id': 193299,
			   'off_start_idx': [0],
			   'off_end_idx': [84],
			   'target': [['group']],
			   'target_group': [['์ง‘๋‹จ-์„ฑ ์ •์ฒด์„ฑ-ํŽ˜๋ฏธ๋‹ˆ์ŠคํŠธ']],
			   'tgt_start_idx': [],
			   'tgt_end_idx': [],}
			   ]
   	}
   ...
]

About

KOLD: Korean Offensive Language Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published