Skip to content

Files

Latest commit

author
“zhurunchuan”
Dec 28, 2024
c05c50e · Dec 28, 2024

History

History

data_process

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Dec 28, 2024
Dec 28, 2024
Dec 28, 2024

If you wish to preprocess these datasets (excluding ARC) on your own, the relevant scripts are as follows.

## Download data
### TriviaQA
cd dataset/download_dataset
wget https://nlp.cs.washington.edu/triviaqa/data/triviaqa-unfiltered.tar.gz
tar -zxvf triviaqa-unfiltered.tar.gz
rm triviaqa-unfiltered.tar.gz

### NQ
git clone https://github.com/google-research-datasets/natural-questions.git

### MMLU
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar
tar -xvf data.tar
mv data mmlu
rm data.tar

## Preprocess data
cd ../..
PYTHONPATH=. python data_process/preprocess/proc_triviaqa.py
PYTHONPATH=. python data_process/preprocess/proc_nq.py
PYTHONPATH=. python data_process/preprocess/proc_mmlu.py