Skip to content

Latest commit

 

History

History

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Fine-Tuning Data

Suggested directory structure

Show  
data/
├── dataset_preprocessing
│    └── ...
├── ner
│    │
│    ├── ar
│    │    ├── train.txt.tmp
│    │    ├── dev.txt.tmp
│    │    └── test.txt.tmp
│    │
│    ├── ...
│    │
│    └── zh
│         ├── train.txt.tmp
│         ├── dev.txt.tmp
│         └── test.txt.tmp
├── sa
│    │
│    ├── ar
│    │    ├── train.tsv
│    │    ├── dev.tsv
│    │    └── test.tsv
│    │
│    ├── ...
│    │
│    └── zh
│         ├── train.tsv
│         ├── dev.tsv
│         └── test.tsv
├── qa
│    │
│    ├── ar
│    │    ├── train-v1.1.json
│    │    └── dev-v1.1.json
│    │
│    ├── ...
│    │
│    └── zh
│         ├── train-v1.1.json
│         └── dev-v1.1.json
│
└── udp_pos
     │
     ├── ar
     │    ├── ar_padt-ud-train.conllu
     │    ├── ar_padt-ud-dev.conllu
     │    └── ar_padt-ud-test.conllu
     │
     ├── ...
     │
     └── zh
          ├── zh_gsd-ud-train.conllu
          ├── zh_gsd-ud-dev.conllu
          └── zh_gsd-ud-test.conllu

Dataset download links

We provide download links to the fine-tuning datasets we used in the table below. We have preprocessed some of them for our experiments.

Important: Please refer to the preprocessing script for each dataset in data_preprocessing. The python scripts all contain docstrings at the top with information on how to use them. For the NER-related bash scripts we provide instructions in this README.md file. If there is neither a dedicated preprocessing dataset, nor instructions in the respective README.md on how to preprocess the data, this means that the data can be used as downloaded and does not require further preprocessing.

Also: When using any of these datasets in your own experiments, don't forget to cite their publications! Feel free to refer to our paper's references if you aren't sure which publication a dataset belongs to.

 

Lang NER SA QA UDP & POS
Arabic Wikiann-panx HARD TyDiQA-GoldP-v1.1 Universal Dependencies 2.6 (Arabic-PADT)
English CoNLL-2003 IMDb Movie Reviews SQuAD-v1.1 (Train, Dev) Universal Dependencies 2.6 (English-EWT)
Finnish FiNER --- TyDiQA-GoldP-v1.1 Universal Dependencies 2.6 (Finnish-FTB)
Indonesian Wikiann-panx Indonesian Prosa TyDiQA-GoldP-v1.1 Universal Dependencies 2.6 (Indonesian-GSD)
Japanese Wikiann-panx Yahoo Movie Reviews --- Universal Dependencies 2.6 (Japanese-GSD)
Korean Corpus-morpheme Naver Sentiment Movie Corpus (NSMC) KorQuAD 1.0 Universal Dependencies 2.6 (Korean-GSD)
Russian Wikiann-panx RuReviews SberQuAD Universal Dependencies 2.6 (Russian-GSD)
Turkish Wikiann-panx Turkish Movie and Product Reviews TQuAD-v0.1 Universal Dependencies 2.6 (Turkish-IMST)
Chinese Chinese literature ChnSentiCorp Delta Reading Comprehension Dataet (DRCD) Universal Dependencies 2.6 (Chinese-GSD)