distance_transformer

The official implementation of the paper "Syntax-guided Localized Self-attention by Constituency Syntactic Distance"(findings of EMNLP 2022)

We provide the code for reproducing our model's result and data preprocessing.

Requirements

You can directly download this code and run a requirement installation: pip install -r requirements.txt on your conda environment.(our python version is 3.6)

Then it's necessary to run the setting up code pip install --editable ./ and it's camera-ready now to run shell scripts in the run directory.

Data Preparation

The pre-processing scripts for every dataset can be found in the corresponding preprocessing/ folder, and rely on our code (scripts/). Take notice that scripts with prepare prefix aims to download, split and clean the dataset, while scripts with preprocess prefix aims to binarize the indexed dataset.Third-party software toolkits are automatically downloaded in the script.

Totally six machine translation datasets are entailed as follows

IWSLT14 German to English
IWSLT14 English to German
NC11 German to English
NC11 English to German
ASPEC Chinese to Japanese
WMT14 English to German

For instance, if you want to prepare dataset iwslt14de2en/en2de, run the corresponding data preparation script

cd preprocess bash prepare-iwslt14.sh

Distance Preparation

To run our syntactic based model, the sytactic distance of sourcce language sentence must be firstly generated, and the scripts here could be directly run.

For instance, if you want to prepare syntactic distance of source language iwslt14de2en, which is German, run the corresponding distance preparation script(data preparation must be completed first)

bash distance_iwslt_de2en.sh

Training and Evaluation

Scripts for training each model are provided in the folder run/. Each script is suffixed with corresponding task name including source language and target language. Each script is binded with training and testing process altogether. For running the script, enter the folder run first and use bash command. For example,

cd run

bash train_iwslt_de2en.sh

The final BLEU score for the test set will be logged into a .txt file.

Description of this repository

data-bin/
Contains the binarized dataset for the fairseq toolkit to read in. Dataset implementations are set to memory mapped.
distance_prior/
Contains the calculated syntactic distance for every task. Only source language sentence is calculated. The syntactic distance is stored in the form of numpy files for each sentence pair.
run/
Shell scripts for training, validation and test.
log/
Directory for storing the running result, including tensorboard log directory, saved checkpoint and training/test log.
preprocess/
Shell scripts for training, validation and test.
fairseq/ Model definition folder, crucial files are:
- models/distance_transformer.py Define the overall architecture of Transformer.
- modules/transformer_layer_distance.py Define the encoder layer and decoder layer respectively.
- modules/multihead_attention_distance.py Define the multi-head attention guided by constituency based syntactic distance,
- models/distance_transformer.py: Transformer baseline from Vaswani et al. (NIPS'17)
- modules/transformer_layer_distance.pyTransformer baseline encoder and decoder layer.
- modules/multihead_attention_distance.py Transformer baseline multi-head self-attention.
- criterions/labeL_smoothed_cross_entropy.py
scripts/ tests/ helpers/ docs/ build/
Other auxiliary folders for compilation and running.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
build		build
distance_preparation		distance_preparation
docs		docs
fairseq.egg-info		fairseq.egg-info
fairseq		fairseq
fairseq_cli		fairseq_cli
helpers		helpers
preprocess		preprocess
run		run
scripts		scripts
tests		tests
LICENSE		LICENSE
README.md		README.md
generate.py		generate.py
heatmap.py		heatmap.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

distance_transformer

Requirements

Data Preparation

Distance Preparation

Training and Evaluation

Description of this repository

About

Releases

Packages

Languages

License

LUMIA-Group/distance_transformer

Folders and files

Latest commit

History

Repository files navigation

distance_transformer

Requirements

Data Preparation

Distance Preparation

Training and Evaluation

Description of this repository

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages