Missing attributes start_char and end_char for certain tokens in Stanza pipeline output. #1436

al3xkras · 2024-11-26T22:19:54Z

Describe the bug:

The output of the Stanza pipeline is missing start_char and end_char values for certain tokens. This issue can be observed in the following example, where the token 'It"s' lacks start_char and end_char values, even though these fields are present for other tokens in the output.

Steps to reproduce:

Import Stanza and initialize a pipeline:

import stanza
pipeline = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,ner,depparse', download_method=None, verbose=0)
Pass a simple sentence through the pipeline:

pipeline('It"s an example sentence.')
Check the output. The token with id=1 does not have start_char and end_char values.

Expected behavior:

All tokens in the output should include start_char and end_char values.

Actual behavior:

The token with id=1 ('It"s') lacks start_char and end_char values, while all other tokens include these fields.

[
  [
    {
      "id": 1,
      "text": "It\"s",
      "lemma": "irc",
      "upos": "VERB",
      "xpos": "VBZ",
      "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
      "head": 0,
      "deprel": "root",
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 2,
      "text": "an",
      "lemma": "a",
      "upos": "DET",
      "xpos": "DT",
      "feats": "Definite=Ind|PronType=Art",
      "head": 4,
      "deprel": "det",
      "start_char": 5,
      "end_char": 7,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
...

Environment (please complete the following information):

Kernel version: 6.11.2-amd64
Python version: 3.11.10
Installed packages:
- numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1728239949208/work/dist/numpy-2.1.2-cp311-cp311-linux_x86_64.whl#sha256=768addcb66d11bf95f7d2036c7b2595c638ef6539dba5f6a98faa0cdd9170ce8
- stanza==1.9.2

The text was updated successfully, but these errors were encountered:

… object has start_char and end_char. Will accommodate MWT Tokens which were detected by the tokenizer but not expanded by the MWT model, which can happen with typos such as it"s #1436

AngledLuffa · 2024-11-27T08:08:12Z

Thanks, that's interesting. There are two models for splitting words like it's or it"s into pieces, the tokenizer and the MWT seq2seq model. Apparently the tokenizer detects it"s as needing to be split, but then the seq2seq model doesn't actually expand it into anything, and the code later on doesn't properly handle that case.

That much is fixed on dev, but I wonder if it would be worth training the seq2seq model with ' sometimes replaced with " in the training data just to teach it how to properly expand those words as well. Please don't close the issue so I can take a little time to ponder it

al3xkras · 2024-11-27T17:18:06Z

Thank you for your response and explanation. It's good to know that part of the problem is already addressed on the dev branch.

Regarding the possibility of retraining the seq2seq model with modified training data, I agree that adding such variations (e.g., replacing ' with " in contractions) might be infeasible due to the relatively low likelihood of encountering such specially crafted words in standard text. However, I would like to point out that while these cases might be rare, they can still occur in real-world data—especially when processing text or datasets from diverse sources. As I can attest from my own experience, such a case was already encountered while processing approximately 500MB of text from the internet.

I'm looking forward to new updates and I'll leave the issue open for now

AngledLuffa · 2024-11-29T00:07:54Z

It's actually pretty easy to compensate for, and I can add a couple other unknown apostrophes as well to the model at the same time, so I think this should be pretty easy to fix in a generalized manner

al3xkras added the bug label Nov 26, 2024

AngledLuffa mentioned this issue Nov 29, 2024

Mwt augment #1437

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing attributes start_char and end_char for certain tokens in Stanza pipeline output. #1436

Missing attributes start_char and end_char for certain tokens in Stanza pipeline output. #1436

al3xkras commented Nov 26, 2024

AngledLuffa commented Nov 27, 2024

al3xkras commented Nov 27, 2024 •

edited

Loading

AngledLuffa commented Nov 29, 2024

Missing attributes start_char and end_char for certain tokens in Stanza pipeline output. #1436

Missing attributes start_char and end_char for certain tokens in Stanza pipeline output. #1436

Comments

al3xkras commented Nov 26, 2024

Describe the bug:

Steps to reproduce:

Expected behavior:

Actual behavior:

Environment (please complete the following information):

AngledLuffa commented Nov 27, 2024

al3xkras commented Nov 27, 2024 • edited Loading

AngledLuffa commented Nov 29, 2024

al3xkras commented Nov 27, 2024 •

edited

Loading