wammar-utils

this repository is designed to be included as a submodule in other repositories

description of utilities:

create-vocab.py

a python script that extracts the types in a text file and give them integer ids.

================ encode-corpus.py

a python script that replaces each type in the input file to a unique integer id in the target file. another file is output which contains the id:type mappings.

================ decode-corpus.py

inverse of encode-corpus.py.

========================= filter-long-sent-pairs.py

a python script that filters out parallel sentences with number of tokens.

========================= split-parallel-corpus.py

a python script that splits a parallel corpus into train/dev/test sets.

========================================= american-english.txt, british-english.txt

American vs. British English vocabulary collected from http://www.tysto.com/uk-us-spelling-list.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

wammar-utils

description of utilities:

create-vocab.py

================ encode-corpus.py

================ decode-corpus.py

========================= filter-long-sent-pairs.py

========================= split-parallel-corpus.py

========================================= american-english.txt, british-english.txt

Files

README.md

Latest commit

History

README.md

File metadata and controls

wammar-utils

description of utilities:

create-vocab.py

================ encode-corpus.py

================ decode-corpus.py

========================= filter-long-sent-pairs.py

========================= split-parallel-corpus.py

========================================= american-english.txt, british-english.txt