Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Fix transformers vocab not being saved to files #4023

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Fix transformers vocab not being saved to files #4023

wants to merge 3 commits into from

Conversation

MaksymDel
Copy link
Contributor

@MaksymDel MaksymDel commented Apr 4, 2020

Fixes #3097. Here we add a method to vocab to populate its namespace from transformers encoding (which is fetched in indexer) with an option to resave vocab to files (which is the main problem of #3097).

I reserved another non_padded_namespace family: *from_transformers to use it by default for transformers instead of tags.

Not saving transformers vocab to disc prevents from_archive model loading (because vocab folder can be empty) with all the implications.

I know this fix is not ideal, but I think it is an improvement over what we have currently. We can also open a separate issue dedicated to figuring out transformers <-> allennlp relationships.

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maksym-del! I think this is a reasonable minimal fix to #3097. @dirkgr, you've been thinking about this a bunch too; what do you think?

allennlp/data/vocabulary.py Outdated Show resolved Hide resolved
allennlp/data/vocabulary.py Outdated Show resolved Hide resolved
@MaksymDel MaksymDel requested a review from matt-gardner April 5, 2020 08:40
@@ -28,7 +28,7 @@ class PretrainedTransformerIndexer(TokenIndexer):

model_name : `str`
The name of the `transformers` model to use.
namespace : `str`, optional (default=`tags`)
namespace : `str`, optional (default=`from_transformers`)
We will add the tokens in the pytorch_transformer vocabulary to this vocabulary namespace.
We use a somewhat confusing default value of `tags` so that we do not add padding or UNK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We use a somewhat confusing default value of `tags` so that we do not add padding or UNK
We use `from_transformers` as the default namespace, which gets special treatment in our
`Vocabulary` class to not get padding or UNK tokens.

Actually, how are UNK and padding handled here? Is it possible to set them using the transformer? We probably do want UNK handling and padding tokens, we just need them set in the right way, both here and during loading. This might need some additional changes to how we save and load vocabularies.

And, while we're at it, we probably want the padding and unk tokens to be namespace-dependent.

self,
encoding_dictionary: Dict[str, int],
namespace: str = "from_transformers",
resave_to_files: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting to be a bigger change, but it's probably necessary to really fix this the right way. But I think this function should take arguments for setting padding and oov tokens, and those should set them in a namespace-specific way. And the padding and oov tokens then would need to be serialized with the vocabulary and read from the serialized files. Then we wouldn't need the from_transformers namespace added to the non-padded namespaces.

I would wait to do any work on this, though, until @dirkgr has had a chance to glance at it and see if he agrees. I think this is the right approach, but I could be missing something.

@dirkgr
Copy link
Member

dirkgr commented Apr 6, 2020

It's a minimal fix, but I am very unhappy about the amount of magic going on. The way we do vocabularies already involves an untenable amount of magic, and this adds more. Saving state like this, especially the name of a file that gets overwritten, is just asking for trouble the next time this is used in a scenario we didn't think of.

I sketched an alternative here: #4034. I'll add some discussion over there. It is not as minimal as this is, but it has some other advantages.

@schmmd schmmd changed the base branch from master to main December 23, 2020 18:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rethink vocabulary API
3 participants