-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing attributes start_char and end_char for certain tokens in Stanza pipeline output. #1436
Comments
… object has start_char and end_char. Will accommodate MWT Tokens which were detected by the tokenizer but not expanded by the MWT model, which can happen with typos such as it"s #1436
Thanks, that's interesting. There are two models for splitting words like That much is fixed on dev, but I wonder if it would be worth training the seq2seq model with |
Thank you for your response and explanation. It's good to know that part of the problem is already addressed on the dev branch. Regarding the possibility of retraining the seq2seq model with modified training data, I agree that adding such variations (e.g., replacing ' with " in contractions) might be infeasible due to the relatively low likelihood of encountering such specially crafted words in standard text. However, I would like to point out that while these cases might be rare, they can still occur in real-world data—especially when processing text or datasets from diverse sources. As I can attest from my own experience, such a case was already encountered while processing approximately 500MB of text from the internet. I'm looking forward to new updates and I'll leave the issue open for now |
It's actually pretty easy to compensate for, and I can add a couple other unknown apostrophes as well to the model at the same time, so I think this should be pretty easy to fix in a generalized manner |
Describe the bug:
The output of the Stanza pipeline is missing start_char and end_char values for certain tokens. This issue can be observed in the following example, where the token 'It"s' lacks start_char and end_char values, even though these fields are present for other tokens in the output.
Steps to reproduce:
Import Stanza and initialize a pipeline:
import stanza
pipeline = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,ner,depparse', download_method=None, verbose=0)
Pass a simple sentence through the pipeline:
pipeline('It"s an example sentence.')
Check the output. The token with id=1 does not have start_char and end_char values.
Expected behavior:
All tokens in the output should include start_char and end_char values.
Actual behavior:
The token with id=1 ('It"s') lacks start_char and end_char values, while all other tokens include these fields.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: