Show
data/
├── dataset_preprocessing
│ └── ...
├── ner
│ │
│ ├── ar
│ │ ├── train.txt.tmp
│ │ ├── dev.txt.tmp
│ │ └── test.txt.tmp
│ │
│ ├── ...
│ │
│ └── zh
│ ├── train.txt.tmp
│ ├── dev.txt.tmp
│ └── test.txt.tmp
├── sa
│ │
│ ├── ar
│ │ ├── train.tsv
│ │ ├── dev.tsv
│ │ └── test.tsv
│ │
│ ├── ...
│ │
│ └── zh
│ ├── train.tsv
│ ├── dev.tsv
│ └── test.tsv
├── qa
│ │
│ ├── ar
│ │ ├── train-v1.1.json
│ │ └── dev-v1.1.json
│ │
│ ├── ...
│ │
│ └── zh
│ ├── train-v1.1.json
│ └── dev-v1.1.json
│
└── udp_pos
│
├── ar
│ ├── ar_padt-ud-train.conllu
│ ├── ar_padt-ud-dev.conllu
│ └── ar_padt-ud-test.conllu
│
├── ...
│
└── zh
├── zh_gsd-ud-train.conllu
├── zh_gsd-ud-dev.conllu
└── zh_gsd-ud-test.conllu
We provide download links to the fine-tuning datasets we used in the table below. We have preprocessed some of them for our experiments.
Important: Please refer to the preprocessing script for each dataset in data_preprocessing. The python scripts all contain docstrings at the top with information on how to use them. For the NER-related bash scripts we provide instructions in this README.md file. If there is neither a dedicated preprocessing dataset, nor instructions in the respective README.md on how to preprocess the data, this means that the data can be used as downloaded and does not require further preprocessing.
Also: When using any of these datasets in your own experiments, don't forget to cite their publications! Feel free to refer to our paper's references if you aren't sure which publication a dataset belongs to.
Lang | NER | SA | QA | UDP & POS |
---|---|---|---|---|
Arabic | Wikiann-panx | HARD | TyDiQA-GoldP-v1.1 | Universal Dependencies 2.6 (Arabic-PADT) |
English | CoNLL-2003 | IMDb Movie Reviews | SQuAD-v1.1 (Train, Dev) | Universal Dependencies 2.6 (English-EWT) |
Finnish | FiNER | --- | TyDiQA-GoldP-v1.1 | Universal Dependencies 2.6 (Finnish-FTB) |
Indonesian | Wikiann-panx | Indonesian Prosa | TyDiQA-GoldP-v1.1 | Universal Dependencies 2.6 (Indonesian-GSD) |
Japanese | Wikiann-panx | Yahoo Movie Reviews | --- | Universal Dependencies 2.6 (Japanese-GSD) |
Korean | Corpus-morpheme | Naver Sentiment Movie Corpus (NSMC) | KorQuAD 1.0 | Universal Dependencies 2.6 (Korean-GSD) |
Russian | Wikiann-panx | RuReviews | SberQuAD | Universal Dependencies 2.6 (Russian-GSD) |
Turkish | Wikiann-panx | Turkish Movie and Product Reviews | TQuAD-v0.1 | Universal Dependencies 2.6 (Turkish-IMST) |
Chinese | Chinese literature | ChnSentiCorp | Delta Reading Comprehension Dataet (DRCD) | Universal Dependencies 2.6 (Chinese-GSD) |