This folder contains recipes for self-training on pseudo phone transcripts and decoding into phones or words with kaldi.
To start, download and install kaldi follow its instruction, and place this
folder in path/to/kaldi/egs
.
Assuming the following has been prepared:
w2v_dir
: contains features{train,valid}.{npy,lengths}
, real transcripts{train,valid}.${label}
, and dictdict.${label}.txt
lab_dir
: contains pseudo labels{train,valid}.txt
arpa_lm
: Arpa-format n-gram phone LM for decodingarpa_lm_bin
: Arpa-format n-gram phone LM for unsupervised model selection to be used with KenLM
Set these variables in train.sh
, as well as out_dir
, the output directory,
and then run it.
The output will be:
==== WER w.r.t. real transcript (select based on unsupervised metric)
INFO:root:./out/exp/mono/decode_valid/scoring/14.0.0.tra.txt: score 0.9178 wer 28.71% lm_ppl 24.4500 gt_wer 25.57%
INFO:root:./out/exp/tri1/decode_valid/scoring/17.1.0.tra.txt: score 0.9257 wer 26.99% lm_ppl 30.8494 gt_wer 21.90%
INFO:root:./out/exp/tri2b/decode_valid/scoring/8.0.0.tra.txt: score 0.7506 wer 23.15% lm_ppl 25.5944 gt_wer 15.78%
where wer
is the word eror rate with respect to the pseudo label, gt_wer
to
the ground truth label, lm_ppl
the language model perplexity of HMM prediced
transcripts, and score
is the unsupervised metric for model selection. We
choose the model and the LM parameter of the one with the lowest score. In the
example above, it is tri2b
, 8.0.0
.
In decode_phone.sh
, set out_dir
the same as used in train.sh
, set
dec_exp
and dec_lmparam
to the selected model and LM parameter (e.g.
tri2b
and 8.0.0
in the above example). dec_script
needs to be set
according to dec_exp
: for mono/tri1/tri2b, use decode.sh
; for tri3b, use
decode_fmllr.sh
.
The output will be saved at out_dir/dec_data
decode_word_step1.sh
prepares WFSTs for word decoding. Besides the variables
mentioned above, set
wrd_arpa_lm
: Arpa-format n-gram word LM for decodingwrd_arpa_lm_bin
: Arpa-format n-gram word LM for unsupervised model selection
decode_word_step1.sh
decodes the train
and valid
split into word and runs
unsupervised model selection using the valid
split. The output is like:
INFO:root:./out/exp/tri2b/decodeword_valid/scoring/17.0.0.tra.txt: score 1.8693 wer 24.97% lm_ppl 1785.5333 gt_wer 31.45%
After determining the LM parameter (17.0.0
in the example above), set it in
decode_word_step2.sh
and run it. The output will be saved at
out_dir/dec_data_word
.