Need help in understanding spacy debug data output #13089

vignesh-spericorn · 2023-10-27T08:24:44Z

vignesh-spericorn
Oct 27, 2023

Hello,

In the following output for spacy debug data command there is a line "⚠ 20912 words in training data without vectors"
What does this line mean ? and How does it affects the training & model performance ?

Kindly seeking the community's help.
Thanks in Advance

============================ Data file validation ============================
✔ Pipeline can be initialized with data
✔ Corpus is loadable

=============================== Training stats ===============================
Language: en
Training pipeline: tok2vec, ner
6795 training docs
500 evaluation docs
⚠ 105 training examples also in evaluation data

============================== Vocab & Vectors ==============================
ℹ 83786 total word(s) in the data (12761 unique)
ℹ 514157 vectors (514157 unique keys, 300 dimensions)
⚠ 20912 words in training data without vectors (25%)

========================== Named Entity Recognition ==========================
ℹ 3 label(s)
0 missing value(s) (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
✔ No entities crossing sentence boundaries

================================== Summary ==================================
✔ 6 checks passed
⚠ 2 warnings

rmitsch · 2023-10-30T14:47:00Z

rmitsch
Oct 30, 2023
Maintainer

Hi @vignesh-spericorn, this means that 20912 words in your training data do not have vector representations in the set of vectors you are using. This can happen for specialized or newer terms that are not part of the model's vocabulary.

Have a look at the docs for more info on this.

6 replies

rmitsch Oct 31, 2023
Maintainer

Out-of-vocabulary tokens will be assigned a zero-vector, which will impact downstream accuracy negatively (as these tokens' embeddings won't contain meaningful information).

vignesh-spericorn Oct 31, 2023
Author

Thank you for the reply. Is it possible to identify and create vectors for these 20912 words ?

rmitsch Oct 31, 2023
Maintainer

You can use Token.is_oov for that.

vignesh-spericorn Nov 1, 2023
Author

Is it possible to create to create vectors for these 20912 words ?

rmitsch Nov 2, 2023
Maintainer

You would need to train a new set of word embeddings from scratch (see also this discussion) or realign the word embeddings trained on the 20912 OOV words with the old embedding, which might be tricky.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help in understanding spacy debug data output #13089

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Need help in understanding spacy debug data output #13089

vignesh-spericorn Oct 27, 2023

Replies: 1 comment · 6 replies

rmitsch Oct 30, 2023 Maintainer

rmitsch Oct 31, 2023 Maintainer

vignesh-spericorn Oct 31, 2023 Author

rmitsch Oct 31, 2023 Maintainer

vignesh-spericorn Nov 1, 2023 Author

rmitsch Nov 2, 2023 Maintainer

vignesh-spericorn
Oct 27, 2023

Replies: 1 comment 6 replies

rmitsch
Oct 30, 2023
Maintainer

rmitsch Oct 31, 2023
Maintainer

vignesh-spericorn Oct 31, 2023
Author

rmitsch Oct 31, 2023
Maintainer

vignesh-spericorn Nov 1, 2023
Author

rmitsch Nov 2, 2023
Maintainer