Need help in understanding spacy debug data output #13089
vignesh-spericorn
started this conversation in
Help: Best practices
Replies: 1 comment 6 replies
-
Hi @vignesh-spericorn, this means that 20912 words in your training data do not have vector representations in the set of vectors you are using. This can happen for specialized or newer terms that are not part of the model's vocabulary. Have a look at the docs for more info on this. |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
In the following output for spacy debug data command there is a line "⚠ 20912 words in training data without vectors"
What does this line mean ? and How does it affects the training & model performance ?
Kindly seeking the community's help.
Thanks in Advance
============================ Data file validation ============================
✔ Pipeline can be initialized with data
✔ Corpus is loadable
=============================== Training stats ===============================
Language: en
Training pipeline: tok2vec, ner
6795 training docs
500 evaluation docs
⚠ 105 training examples also in evaluation data
============================== Vocab & Vectors ==============================
ℹ 83786 total word(s) in the data (12761 unique)
ℹ 514157 vectors (514157 unique keys, 300 dimensions)
⚠ 20912 words in training data without vectors (25%)
========================== Named Entity Recognition ==========================
ℹ 3 label(s)
0 missing value(s) (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
✔ No entities crossing sentence boundaries
================================== Summary ==================================
✔ 6 checks passed
⚠ 2 warnings
Beta Was this translation helpful? Give feedback.
All reactions