README to keep notes about attempts to run TSAR code without too much of changing the code base.
Everything is done with python 3.10
The original paper used 2020 version of IBM parser (stack-transformat), however running that version on modern CUDA versions wasn't so easy. So I just decided to run the latest version of the IBM parser (in the main
branch).
transition-neural-parser
==0.5.4 (comes withtorch
==1.13)torch-scatter
pytorch
==1.11.0- torch 1.9 prebuilt packages do not support recent cuda versions
- building torch from the source code is too much hassle
- ended up running 1.11 with CUDA 11.3
transformers
==4.8.1datasets
==2.13.0dgl-cu111
==1.1.0+cu113tqdm
==4.65.0spacy
==3.2.4- (
accelerate
==0.19.0)- Multi-CPU/GPU/TPU training
Dependency conflicts may arise when running the full pipeline in one conda/venv environment. If this occurs, we recommend running the training (requiring all packages except spacy) and evaluation (requiring spacy) in separate environments.
You can first download the datasets and some scripts here. You only need to unzip the data.zip.
Data comes in two directories; wikievents
and rams
. I think wikievents
is not relevant.
RAMS data included in the zip file is actually identical to RAMS 1.0c from the official website, except for the meta.json
and result.json
files.
(TODO RAMS 1.0c contains scorer scripts that fix some bug in the previous version. We might need to check if TSAR is using those scorer scripts.)
meta.json
: The wikievents data in the zip file also hasmeta.json
file, and there'smake_meta.py
script that takes wikievents dataset and generate thatmeta.json
file. For RAMS, there'smeta.json
but nomake_meta
script, so I re-createdmake_meta.py
for RAMS based on the contents of themeta.json
file.result.json
: Not sure what it is. Searching in the original codebase (github repo and local clone) didn't give any clue.
As mentioned above, the original parser wasn't easy to install, so I used the latest IBM parser to generate penman + jamr formatted .amr.txt
files.
We also directly provide the data (used in the original paper) here. In this way, you can just skip the AMR and DGL graph preprocessing steps. If you want to run this model with GL events or remake any of the DGL graphs based on different edge clusters, you will need to run the whole preprocessing pipeline starting from the AMR
.txt
or.pkl
files. If you create or change the edge clusters, be sure to make a corresponding change in lines 172 and 497 ofmodel.py
for the base and the large model. (13 should be replaced with the number of unique edge clusters you are using).
The data in the link is supposedly saved with torch 1.9 + dgl 0.6, which doesn't load with torch 1.13 + dgl 1.1.0. So I have to slight edit amr2dgl
script to load newly generate .amr.txt
files.
The training scripts are provided.
bash run_rams_base.sh <data-directory> bash run_rams_large.sh <data-directory> bash run_wikievents_base.sh <data-directory> bash run_wikievents_large.sh <data-directory>
If you want to train the model on more than one device, run the following script and follow the prompts according to your setup (# of accumulation steps) is set in the bash scripts to run the training.
accelerate configThen, run the training scripts.
Running run_rams_base.sh
originally hit a out-of-index error at where the original authors acknowledge a possible problem.
https://github.com/RunxinXu/TSAR/blob/9806edfb5a7f90b9ae85ff06f435c20e4222be59/code/run.py#L443-L444