Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

WIP: huggingface tokenizer and Neural LM training pipeline. #139

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

glynpu
Copy link
Contributor

@glynpu glynpu commented Mar 25, 2021

Fixes #132
2021-04-23 use AM model trained with full librispeech data

rescore LM epoch num_paths token ppl word ppl test-clean test-other
baseline no rescore (Piotr's am with full librispeech) * * * * 4.71 9.66
4-gram LM n-best rescore(Piotr's am with full librispeech) * 100 * * 4.38 9.18
4-gram LM lattice rescore * * * * 4.18 8.54
transformer LM layers: 16 (model_size: 72M) max_norm=5 9 100 45.02 115.24 3.61 8.29

2021-04-21
max_norm=5 is better than max_norm=0.25. The training is ongoing.
16 layers trained with Noam optimizer got a better wer than previous 8-layer transformers.
But with this reference, max_norm=0.25 in clip_grad_norm_ seems TOO SMALL, which may explains epoch-19 only obtain a little gains comparing to epoch-3.
Now max_norm=5 is used refering to espent transformer lm , and results coming soon.

rescore LM epoch num_paths token ppl word ppl test-clean test-other
baseline no rescore (from fangjun) * * * * 6.80 18.03
4-gram LM (from fangjun) * 100 * * 6.28 16.94
transformer LM layers: 8 (model_size: 42M) 10 100 55.04 148.07 5.66 16.09
30 100 53.16 141.77 5.60 16.09
transformer LM layers: 16 (model_size: 72M) 2 100 51.86 139.35 5.51 16.00
3 100 51.20 135.37 5.47 15.90 
19 100 48.58 126.71 5.37 15.77
transformer LM layers: 16 (model_size: 72M) max_norm=5 1 100 46.94 121.41 5.39 15.73
4 100 45.88 118 5.27 15.73

--------- previous comments------
This commit is mainly about hugginface tokenizer and
a draft transformer/RNN based LM training pipeline.

They are implemented mainly by referencing the follwing tutorials: tokenizer and neural LM which is also referenced by Espnet

Current (tokenizer + transformer LM) experiment shows that the PPL can decrease from around 1000 to aroud 110 with 10 epochs, as shown by the following screenshots.

255c9a5d80f4b3a38e86186936fcacd
d0e25a3ebc12a5e3a19c9d54e4fecfa

TODOs:
1. Extend this training pipeline with advanced utils, such as multi-thread prefetching Dataloader with proper collate_fn and tensorboard summary writer.
2. Evaluation/test parts
3. Do experiments with full Librispeech data. Currently only 50MB training text is used out of around 4GB.
4. A proper way to integrate NNLM into previous asr decode pipeline, i.e. the aim of the issue #132
5. Try other network structures.

This commit is mainly about hugginface tokenizer and
a draft transformer/RNN based LM training pipeline.
@@ -0,0 +1,154 @@
import math
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should get in the habit of acknowledging where we got files from, if they were copied from elsewhere...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I will add a reference into every file. Now all references are together added in run.sh.

@danpovey
Copy link
Contributor

These perplexities, are they per word or per token?

@glynpu
Copy link
Contributor Author

glynpu commented Mar 28, 2021

These perplexities, are they per word or per token?

per token.

lm_train=data/lm_train/
full_text=$lm_train/librispeech_train_960_text
tokenizer=$lm_train/tokenizer-librispeech_train_960.json
if [ $stage -eq 1 ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be $stage -le 1?
And also for the following if statements.

Copy link
Contributor Author

@glynpu glynpu Mar 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. "-le" is better. Now "-eq" is used temporarily beacuse it's easier for me to debug stage by stage.

import os
import shutil
from pathlib import Path
from tokenizers import Tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some documentation describing how the environment is set up?
I assume that you have run pip install tokenizers beforehand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. A Readme.md will be added.

# Save the model if the validation loss is the best we've seen so far.
if not best_val_loss or val_loss < best_val_loss:
with open(args.save, 'wb') as f:
torch.save(model, f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://pytorch.org/tutorials/beginner/saving_loading_models.html

The disadvantage of this approach is that the serialized data is bound to the specific classes and the exact directory structure used when the model is saved.

Could you save only the state dict of the model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solved as following

def save_checkpoint(filename: Pathlike,
                    model: torch.nn.Module,
                    info: Info = None) -> None:
    if not os.path.exists(os.path.dirname(filename)):
        Path(os.path.dirname(filename)).mkdir(parents=True, exist_ok=True)
    logging.info(f'Save checkpoint to {filename}')
    checkpoint = {
        'state_dict': model.state_dict(),
    }
    if info is not None:
        checkpoint.update(info)

    torch.save(checkpoint, filename)

epoch, batch_idx,
len(train_data) // batch_size, lr,
elapsed * 1000 / args.log_interval, cur_loss,
math.exp(cur_loss)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These perplexities, are they per word or per token?

@danpovey
The perplexities are computed as exp(NLL) and the modelling units are tokens so
PPL is computed with respect to tokens.

@csukuangfj
Copy link
Collaborator

the PPL can decrease from around 1000 to aroud 110 with 10 epochs,

@glynpu Do you know what is the normal PPL for the LibriSpeech corpus in terms of tokens?

@danpovey
Copy link
Contributor

danpovey commented Mar 28, 2021 via email

@danpovey
Copy link
Contributor

danpovey commented Mar 28, 2021 via email

@glynpu
Copy link
Contributor Author

glynpu commented Mar 28, 2021

As shown by RNN-LM experiment in kaldi with librispeech data,

# rnnlm/train_rnnlm.sh: train/dev perplexity was 109.2 / 110.7.

I am studying its configuration and hope to get a comparable ppl with the same data this week.

num_utts_total=$(wc -l <$full_tokens )
num_valid_test=$(($num_utts_total/${valid_test_fraction}))
set +x
shuf -n $num_valid_test $full_tokens > $valid_test_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we fix the seed for shuf so that the split is reproducible?
I think a Python script can do this task equally well and is more maintainable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reproducible is important. Maybe the data seperation method of kaldi RNNLM can be used in following experiments.
gunzip -c $text | cut -d ' ' -f2- | awk -v text_dir=$text_dir '{if(NR%2000 == 0) { print >text_dir"/dev.txt"; } else {print;}}' >$text_dir/librispeech.txt

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for dropping bash/perl entirely for these sorts of tasks in snowfall.

@danpovey
Copy link
Contributor

danpovey commented Mar 29, 2021 via email

# │ e k q w │
# └ f l r x ┘.
# These columns are treated as independent by the model, which means that the
# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that could be overcome by a data sampling and batching strategy where you iterate on the train text with overlapping windows (50% overlap being the obvious setting but for larger data probably a smaller value like 20% would work just as well and train faster)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the data treated as one long sequence, rather than a bunch of independent sentences?
I would have thought for ASR applications, the independent-sentences approach might make more sense.

Copy link
Contributor Author

@glynpu glynpu Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, training text is not treated as a long sequence. I have modified the data preparation method so that each piece of text is treated independently. Sorry to forget to delete these unrelated original comments.
By the way, I am refactoring the training pipeline according these reviews. Temporarily, a new dataset class is located here, which handles training text one by one and then batchify them independently in CollateFunc.

        with open(text_file, 'r') as f:
            # a line represent a piece of text, e.g.
            # DELAWARE IS NOT AFRAID OF DOGS
            for line in f:
                text = line.strip().split()
                assert len(text) > 0
                text_id = self.text2id(text)
                # token_id format:
                # <bos_id> token_id token_id token_id *** <eos_id>
                token_id = self.text_id2token_id(text_id)
                self.data.append(token_id)

args = get_args()
if args.train_file is not None:
train_files = [args.train_file]
train_tokenizer(train_files, args.tokenizer_path, args.vocab_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

methods like these (train_tokenizer, tokenize_text) would be good candidates to put into the "library" part of snowfall so anybody can import them easily for all the recipes.

Candidate for future work in snowfall: actually this whole script could be easily re-used across recipes had we added a mechanism for auto-registering scripts in PATH (can be done via setup.py)

num_utts_total=$(wc -l <$full_tokens )
num_valid_test=$(($num_utts_total/${valid_test_fraction}))
set +x
shuf -n $num_valid_test $full_tokens > $valid_test_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for dropping bash/perl entirely for these sorts of tasks in snowfall.

@danpovey
Copy link
Contributor

danpovey commented Mar 30, 2021 via email

def __getitem__(self, idx):
return self.data[idx]

def text2id(self, text: List[str]) -> List[int]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following two methods can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

nn.init.uniform_(self.decoder.weight, -initrange, initrange)

def forward(self, input, hidden):
# import pdb; pdb.set_trace()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have the dimensions commented here, e.g. is it (batch_size, num_steps)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@danpovey
Copy link
Contributor

danpovey commented Apr 1, 2021

Something is not installed...

 ./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
  File "local/huggingface_tokenizer.py", line 12, in <module>
    from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'

I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?

@glynpu
Copy link
Contributor Author

glynpu commented Apr 1, 2021

Something is not installed...

 ./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
  File "local/huggingface_tokenizer.py", line 12, in <module>
    from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'

I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?

A commit to handle this together with other known bugs will be submitted this afternoon.

scripts to install tokenizers
fix training bugs
port online tokenization to offline tokenization
load/save checkpoint
@glynpu
Copy link
Contributor Author

glynpu commented Apr 1, 2021

Something is not installed...

 ./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
  File "local/huggingface_tokenizer.py", line 12, in <module>
    from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'

I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?

@danpovey add statement to automatically install dependencies in run.sh

if [ $stage -eq -1 ]; then
  # env for experiment ../simple_v1 is expected to have been built.
  echo "Install extra dependencies"
  pip install -r requirements.txt
fi

Now I am still facing some converging issues. With several epochs, the ppl stuck around 1000.
I am not sure where there are some critical unkown bugs or just because of unapproriate hype-parameters configuration.

from typing import Any, Dict, Iterable, List, Optional, Tuple, Union

Pathlike = Union[str, Path]
Info = Union[dict, None]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to Info = Optional[dict]

# token_id format:
# <bos_id> token_id token_id token_id *** <eos_id>
if len(token_id) >= 2:
for idx, line in enumerate(f):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idx is never used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -37,35 +37,22 @@ def __call__(self, batch: List[List[int]]):

class LMDataset(Dataset):

def __init__(self, text_file: str, lexicon):
def __init__(self, text_file: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe the format of text_file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -29,17 +30,41 @@ def get_args():


def generate_tokens(args):
''' Extract symbols and there corresponding ids from a tokenizer,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: the corresponding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fxied

tokenizer = Tokenizer.from_file(args.tokenizer_path)
symbols = tokenizer.get_vocab()
tokens_file = '{}/tokens.txt'.format(args.lexicon_path)
tokens_f = open(tokens_file, 'w')
for idx, sym in enumerate(symbols):
tokens_f.write('{} {}\n'.format(sym.lower(), idx))
id2sym = dict((v, k.lower()) for k, v in symbols.items())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id2sym = {idx: sym.lower() for sym, idx in symbols.items()}

is much clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

for idx, sym in enumerate(symbols):
tokens_f.write('{} {}\n'.format(sym.lower(), idx))
id2sym = dict((v, k.lower()) for k, v in symbols.items())
for idx in range(len(symbols)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it required that the resulting file has its second column listed in increasing order?
Otherwise, it does not need to create another intermediate variable id2sym.
We can iterate over symbols directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to ensure that ids are continues. And a ordered tokens.list looks nice.
result is nort sorted if we iterate over symbols directly, output by:

    for k, v in symbols.items():
        print(k.lower(), v)

looks like following(quite disorded):
'''
##ark 335
##umes 3822
vain 3593
eastern 4515
next 1372
knowing 4454
##jo 2789
western 3987
garden 1387
tree 1348
'''

output = tokenizer.encode(word)
tokens = ' '.join(output.tokens)
else:
tokens = '[unk]'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between [unk] and <UNK>?
I find that you're using <UNK> in the above special_words, but [unk] here.

BTW: what are special_words for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special tokens is a heritage of words.txt: simple_v1/data/lang_nosp/words.txt. whose head is:

<eps> 0
!SIL 1
<SPOKEN_NOISE> 2
<UNK> 3
A 4
...
#0 200004
<s> 200005
</s> 200006

I just want to make sure every word in words.txt could be tokenized. As thoses special workds not "real" words, I think map them to [unk] is better than tokenized by a trained tokenizer.

In short, [UNK] amother with other special words is a heritage from upstream asr pipeline. and [unk] is a token by huggingface tokenizer.


train_data_loader = DataLoader(train_dataset,
batch_size=args.batch_size,
shuffle=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set shuffle to True for training?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. shuffle=True is used for debug to easily trace whether Dataloader and collate function works as expected.

batch_input, batch_target = batch
batch_input = batch_input.to(self.device)
batch_target = batch_target.to(self.device)
self.model.to(self.device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if this to(self.device) is moved out of the loop. It needs to be done
only once, e.g., inside the constructor self.__init__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Args:
x: the sequence fed to the positional encoder model (required).
Shape:
x: [sequence length, batch size, embed dim]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if you get the habit of writing more documentation.

You're saying that the input is of shape [seq_len, batch_size, embedding_dim],
but you are using batch first when invoking pad_sequence in dataset.py. This may explain why the training is not converging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@glynpu glynpu changed the title WIP: hugginface tokenizer and Neural LM training pipeline. WIP: huggingface tokenizer and Neural LM training pipeline. Apr 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Low hanging fruit: neural language model
4 participants