Repository for the KOLD dataset, paper accepted in EMNLP 2022 (main, long).
Authors: Younghoon Jeong, Juhyun Oh, Jongwon Lee, Jaimeen Ahn, Jihyung Moon, Sungjoon Park, and Alice Oh
Institutions: KAIST, Softly AI
Note: This dataset must not be used as training data to automatically generate and publish offensive language online, but by publicly releasing it, we cannot prevent all malicious use. We do not condone any malicious use and urge researchers and practitioners to use it in beneficial ways (e.g., to filter out hate speech).
KOLD: Korean Offensive Language Dataset (arXiv version)
Camera-ready version link TBA
data/kold_v1.json
[
{
"guid": "kold-v1_00000",
"source": "naver_news",
"date": "2022-02-16",
"title": "ํ๋ฏธ๋์ฆ์ด ๋ฒ์ฃ๊ฐ ๋๋ ๋๋ผ [์ถ๊ณผ ๋ฌธํ]",
"comment": "๋จ๋
ํ๋ฑ ์ฃผ์ฅํ ๊ฑฐ๋ฉด ์ฌ์ฑ์ง๋ณ์ ์๋ ๋์ํ๋ผ๊ณ ใ
ใ
ใ
๊ทธ๋ฆฌ๊ณ ๋ด ๋ง์ ๊ทธ๋ฅ ์๋น๋ง ๊ฑธ์ง ๋ง๊ณ ํํ์ ๋ค ์ณ๋ฐ์ผ๋ฉด์ ์ ์ฐจ๋ณ๋ฐ๋๋ค๊ณ ๋งํ๋์ง ๋งํด๋ณด๋ผ๊ณ ใ
ใ
ใ
",
"OFF": True,
"TGT": "group",
"GRP": "others-feminist",
"OFF_span": " ์ณ๋ฐ์ผ๋ฉด์ ์ ์ฐจ๋ณ๋ฐ๋๋ค๊ณ ๋งํ๋์ง ๋งํด๋ณด๋ผ๊ณ ใ
ใ
ใ
"
"TGT_span": ""
"raw_labels": [
{'offensiveness': True,
'annotator_id': 191510,
'off_start_idx': [57],
'off_end_idx': [84],
'target': [['group']],
'target_group': [['์ง๋จ-์ฑ ์ ์ฒด์ฑ-์ฌ์ฑ']],
'tgt_start_idx': [],
'tgt_end_idx': []},
{'offensiveness': True,
'annotator_id': 192109,
'off_start_idx': [56],
'off_end_idx': [84],
'target': [['not specified', 'group']],
'target_group': [['์ง๋จ-์ฑ ์ ์ฒด์ฑ-ํ๋ฏธ๋์คํธ', '์ ์ ์์']],
'tgt_start_idx': [],
'tgt_end_idx': []},
{'offensiveness': True,
'annotator_id': 193299,
'off_start_idx': [0],
'off_end_idx': [84],
'target': [['group']],
'target_group': [['์ง๋จ-์ฑ ์ ์ฒด์ฑ-ํ๋ฏธ๋์คํธ']],
'tgt_start_idx': [],
'tgt_end_idx': [],}
]
}
...
]