Skip to content

Replicating partly A Fast and Accurate Dependency Parser Using Neural Networks’ by Danqi Chen and Chris Manning and conducting few expirements

Notifications You must be signed in to change notification settings

BasRizk/DependencyParsingNN

Repository files navigation

DependencyParsingNN

Replicating partly A Fast and Accurate Dependency Parser Using Neural Networks’ by Danqi Chen and Chris Manning and conducting few expirements

prepare_data.py

Converts CoNLL data (train and dev) into features of the parser configuration paired with parser decisions, takes in a dependency tree and, using SHIFT-REDUCE-PARSING, determining parser actions, which will alter the parser configuration, from which the feature set can be determined.

Parameters:

  • -f data files (default: train.orig.conll dev.orig.conll)
  • -trans transition system (default: std for arc-standrad, other options: eager)

Format of generated files

(filename format: WORD_BEFORE_DOT.converted)

prepare_data.py puts the data into csv WORD_BEFORE_DOT.converted file with the 49 columns of information based on the following tokens:

[
  's_1', 's_2', 's_3',
  'b_1', 'b_2', 'b_3',
  'lc_1(s_1)', 'rc_1(s_1)', 'lc_2(s_1)', 'rc_2(s_1)',
  'lc_1(s_2)', 'rc_1(s_2)', 'lc_2(s_2)', 'rc_2(s_2)',
  'lc_1(lc_1(s_1))', 'rc_1(rc_1(s_1))',
  'lc_1(lc_1(s_2))', 'rc_1(rc_1(s_2))'
]

where given a sentence:

  • s_i corresponds to element (token) i on its stack,
  • b_i corresponds to element (token) i on its buffer,
  • lc_i(x) corresponds to ith left child of element x
  • rc_i(x) corresponds to ith right child of element x
  • if any of token is empty, a NULL token is placed instead

The 49 columns consist accordingly of 18 titled just like the notion above containing tokens words themselves, another 18 title similarly but prefixed with pos containing pos tags of those those selected tokens, 12 corresponding to arc-labels of the selected tokens excluding the first 6 parent tokens (on the top of the stack and the buffer), and finally 1 column including the label of the configuration formatted as TRANSITION_TYPE(ARC_DEPENDENCY).

train.py

train.py trains a model given data preprocessed by preparedata.py and writes a model file train.model, including vocab data.

Parameters:

  • -t training file (default: train.converted)
  • -d validation (dev) fiile (default: dev.converted)
  • -E word embedding dimension (default: 50)
  • -e number of epochs (default: 10)
  • -u number of hidden units (default: 200)
  • -lr learning rate (default: 0.01)
  • -reg regularization amount (default: 1e-5)
  • -batch mini-batch size (default: 256)
  • -o model filepath to be written (default: train.model)
  • -emb_w_init embedding weights random normal scaling (default: 0.01)
  • -gpu use gpu (default: True)

parse.py:

Given a trained model file (and possibly vocabulary file reads in CoNLL data and writes CoNLL data where fields 7 and 8 contain dependency tree info.

Parameters:

  • -m model filepath (default: train.model)
  • -i input CoNLL filepath (deault: parse.in)
  • -o output CoNLL filepath (default: parse.out)
  • -verbose show progress bar (default: False)
  • -dropb whether to drop blocking elements while transiting (default: True)
  • -trans transition system (default: std for arc-standrad, other options: eager)

Example

EXEC_FILE = train.py or EXEC_FILE = train-torch.py

python $EXEC_FILE -u $HIDDEN_UNITS -l $LEARNING_RATE -f $MAX_SEQUENCE_LENGTH -b $MINI_BATCH_SIZE -e $NUM_EPOCHS -E $EMBEDDING_FILE -i $DATASET -o $OUT_MODEL_FILE -w $WEIGHTS_INIT -d $DEBUG_FILE

Instructions for Classifying

Parameters:

  • -m model filename (either start with pytorch or without)
  • -i test data-set relative filepath
  • -o output (inference) desired relative filepath

Example

EXEC_FILE = train.py or EXEC_FILE = train-torch.py

python $EXEC_FILE -m nb.4dim.model -i 4dim.sample.txt -o 4dim.out.txt

About

Replicating partly A Fast and Accurate Dependency Parser Using Neural Networks’ by Danqi Chen and Chris Manning and conducting few expirements

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published