Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing attributes start_char and end_char for certain tokens in Stanza pipeline output. #1436

Open
al3xkras opened this issue Nov 26, 2024 · 3 comments
Labels

Comments

@al3xkras
Copy link

Describe the bug:

The output of the Stanza pipeline is missing start_char and end_char values for certain tokens. This issue can be observed in the following example, where the token 'It"s' lacks start_char and end_char values, even though these fields are present for other tokens in the output.

Steps to reproduce:

  1. Import Stanza and initialize a pipeline:

    import stanza
    pipeline = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,ner,depparse', download_method=None, verbose=0)

  2. Pass a simple sentence through the pipeline:

    pipeline('It"s an example sentence.')

  3. Check the output. The token with id=1 does not have start_char and end_char values.

Expected behavior:

All tokens in the output should include start_char and end_char values.

Actual behavior:

The token with id=1 ('It"s') lacks start_char and end_char values, while all other tokens include these fields.

[
  [
    {
      "id": 1,
      "text": "It\"s",
      "lemma": "irc",
      "upos": "VERB",
      "xpos": "VBZ",
      "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
      "head": 0,
      "deprel": "root",
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 2,
      "text": "an",
      "lemma": "a",
      "upos": "DET",
      "xpos": "DT",
      "feats": "Definite=Ind|PronType=Art",
      "head": 4,
      "deprel": "det",
      "start_char": 5,
      "end_char": 7,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
...

Environment (please complete the following information):

  • Kernel version: 6.11.2-amd64
  • Python version: 3.11.10
  • Installed packages:
    • numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1728239949208/work/dist/numpy-2.1.2-cp311-cp311-linux_x86_64.whl#sha256=768addcb66d11bf95f7d2036c7b2595c638ef6539dba5f6a98faa0cdd9170ce8
    • stanza==1.9.2
@al3xkras al3xkras added the bug label Nov 26, 2024
AngledLuffa added a commit that referenced this issue Nov 27, 2024
… object has start_char and end_char.

Will accommodate MWT Tokens which were detected by the tokenizer but not expanded by the MWT model, which can happen with typos such as it"s

#1436
@AngledLuffa
Copy link
Collaborator

Thanks, that's interesting. There are two models for splitting words like it's or it"s into pieces, the tokenizer and the MWT seq2seq model. Apparently the tokenizer detects it"s as needing to be split, but then the seq2seq model doesn't actually expand it into anything, and the code later on doesn't properly handle that case.

That much is fixed on dev, but I wonder if it would be worth training the seq2seq model with ' sometimes replaced with " in the training data just to teach it how to properly expand those words as well. Please don't close the issue so I can take a little time to ponder it

@al3xkras
Copy link
Author

al3xkras commented Nov 27, 2024

Thank you for your response and explanation. It's good to know that part of the problem is already addressed on the dev branch.

Regarding the possibility of retraining the seq2seq model with modified training data, I agree that adding such variations (e.g., replacing ' with " in contractions) might be infeasible due to the relatively low likelihood of encountering such specially crafted words in standard text. However, I would like to point out that while these cases might be rare, they can still occur in real-world data—especially when processing text or datasets from diverse sources. As I can attest from my own experience, such a case was already encountered while processing approximately 500MB of text from the internet.

I'm looking forward to new updates and I'll leave the issue open for now

@AngledLuffa
Copy link
Collaborator

It's actually pretty easy to compensate for, and I can add a couple other unknown apostrophes as well to the model at the same time, so I think this should be pretty easy to fix in a generalized manner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants