version 1.42 versus 1.40 #1181

robertobartolini · 2023-01-18T14:44:33Z

robertobartolini
Jan 18, 2023

Hi,
I'm using Stanza NLP in a Python project. Whan I upgraded from 1.40 to 1.42 version I reliazed that some tokens consist of multiple words and this causes me problems when I transform the output into XML. This is an example ("Donno Esposito"):

225 , , PUNCT FF _ 221 punct _ start_char=1209|end_char=1210 -
226 Donno Esposito Donno Esposito PROPN SP _ 1 _ start_char=1221|end_char=1235 B-PER
227 Giuseppe Giuseppe PROPN SP _ 226 flat:name _ start_char=1236|end_char=1244 E-PER
228 , , PUNCT FF _ 229 punct _ start_char=1244|end_char=1245 -

Is the behavior correct?
and if yes, can this behavior be disabled from Python?
Best ,
Roberto.

AngledLuffa · 2023-01-18T18:08:25Z

AngledLuffa
Jan 18, 2023
Maintainer

Would you give us a language and sample text to run?

…

On Wed, Jan 18, 2023, 6:44 AM robertobartolini ***@***.***> wrote: Hi, I'm using Stanza NLP in a Python project. Whan I upgraded from 1.40 to 1.42 version I reliazed that some tokens consist of multiple words and this causes me problems when I transform the output into XML. This is an example ("Donno Esposito"): 225 , , PUNCT FF _ 221 punct _ start_char=1209|end_char=1210 - 226 Donno Esposito Donno Esposito PROPN SP _ 1 _ start_char=1221|end_char=1235 B-PER 227 Giuseppe Giuseppe PROPN SP _ 226 flat:name _ start_char=1236|end_char=1244 E-PER 228 , , PUNCT FF _ 229 punct _ start_char=1244|end_char=1245 - Is the behavior correct? and if yes, can this behavior be disabled from Python? Best , Roberto. — Reply to this email directly, view it on GitHub <#1181>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWIWIAPI7VN5E2E4JNTWS76V5ANCNFSM6AAAAAAT7FOIXY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

robertobartolini · 2023-01-19T09:25:09Z

robertobartolini
Jan 19, 2023
Author

Hi,
the language is Italian, and this is a sentence (this is a list of nouns, I'll try to find a better sentence that produces this behavior):
"Dichiaro aperta la votazione. Invito il senatore Segretario a procedere alla chiama. Il senatore Segretario GENTILE e, successivamente, la senatrice Segretario SAGGESE fanno l'appello. Prendono parte alla votazione i senatori: Aiello, Airola, Albano, Alberti Casellati, Albertini, Alicata, Amati, Amoruso, Angioni, Aracri, Astorre, Augello, Azzollini Barani, Barozzino, Battista, Bellot, Bencini, Berger, Bernini, Bertorotta, Bertuzzi, Bianconi, Bignami, Bilardi, Bisinella, Bitonci, Blundo, Bocca, Bocchino, Bonfrisco, Borioli, Bottici, Broglia, Bruni, Bruno, Buccarella, Bulgarelli Calderoli, Caleo, Caliendo, Campanella, Candiani, Cantini, Capacchione, Cappelletti, Cardiello, Cardinali, Caridi, Carraro, Casaletto, Casini, Cassano, Casson, Castaldi, Catalfo, Centinaio, Ceroni, Cervellini, Chiavaroli, Chiti, Ciampolillo, Cioffi, Cirinnà, Cociancich, Collina, Colucci, Comaroli, Compagna, Compagnone, Consiglio, Conte, Conti, Corsini, Cotti, Crimi, Crosio, Cucca, Cuomo D'Adda, D'Alì, Dalla Tor, Dalla Zuanna, D'Ambrosio Lettieri, D'Anna, Davico, De Biasi, De Cristofaro, De Monte, De Petris, De Pietro, De Pin, De Poli, De Siano, Del Barba, Della Vedova, Di Biagio, Di Giorgi, Di Maggio, Dirindin, Divina, D'Onghia, Donno Esposito Giuseppe, Esposito Stefano Fabbri, Falanga, Fasano, Fattori, Fattorini, Favero, Fazzone, Fedeli, Ferrara Elena, Ferrara Mario, Filippi, Filippin, Finocchiaro, Fissore, Floris, Formigoni, Fornaro, Fravezzi, Fucksia Gaetti, Galimberti, Gambaro, Gasparri, Gatti, Gentile, Ghedini Rita, Giacobbe, Giannini, Giarrusso, Gibiino, Ginetti, Giovanardi, Giro, Girotto, Gotor Facello, Granaiola, Gualdani, Guerra, Guerrieri Paleotti Ichino, Iurlaro Lai Bachisio, Langella, Laniece, Lanzillotta, Latorre, Lepri, Lezzi, Liuzzi, Lo Giudice, Lo Moro, Lucherini, Lucidi, Lumia Malan, Manassero, Manconi, Mancuso, Mandelli, Mangili, Maran, Marcucci, Margiotta, Marin, Marinello, Marino Luigi, Marino Mauro Maria, Martelli, Martini, Marton, Mastrangeli, Matteoli, Mattesini, Maturani, Mauro Giovanni, Mazzoni, Merloni, Messina, Micheloni, Migliavacca, Milo, Mineo, Minniti, Minzolini, Mirabelli, Molinari, Montevecchi, Monti, Morgoni, Moronese, Morra, Moscardelli, Mucchetti, Munerato, Mussini, Mussolini Naccarato, Nencini, Nugnes Olivero, Orellana, Orrù Padua, Pagano, Pagliari, Paglini, Palermo, Palma, Panizza, Parente, Pegorer, Pelino, Pepe, Perrone, Petraglia, Petrocelli, Pezzopane, Piccoli, Pignedoli, Pizzetti, Puglia, Puglisi, Puppato Ranucci, Razzi, Ricchiuti, Rizzotti, Romani Maurizio, Romani Paolo, Romano, Rossi Gianluca, Rossi Luciano, Rossi Maurizio Giuseppe, Russo, Ruta, Ruvolo Sacconi, Saggese, Santangelo, Santini, Scalia, Scavone, Schifani, Sciascia, Scibona, Scilipoti, Scoma, Serafini, Serra, Sibilia, Silvestro, Simeoni, Sollo, Sonego, Spilabotte, Sposetti, Stefani, Stefano, Susta Tarquinio, Taverna, Tocci, Tomaselli, Tonini, Torrisi, Tremonti, Tronti, Turano Uras Vaccari, Vacciano, Valentini, Vattuone, Verducci, Verro, Vicari, Viceconte, Villari, Volpi Zanda, Zanettin, Zanoni, Zavoli, Zeller, Zin, Zizza, Zuffada. PRESIDENTE. Dichiaro chiusa la votazione e invito i senatori Segretari a procedere allo spoglio delle schede e al computo dei voti, che avverrà nell'adiacente Sala Pannini. In attesa dei risultati della votazione, sospendo la seduta. (La seduta, sospesa alle ore 11,50, è ripresa alle ore 13,03)."

0 replies

AngledLuffa · 2023-01-20T06:50:38Z

AngledLuffa
Jan 20, 2023
Maintainer

I was not able to trigger this with the current dev branch of Stanza. I did the following:

import stanza
pipe = stanza.Pipeline("it", processors="tokenize")
for sentence in doc.sentences:
  for token in sentence.tokens:
    if " " in token:
      print(token)

and there were no results. If I search for Donno instead, I get

[
  {
    "id": 225,
    "text": "Donno",
    "start_char": 1221,
    "end_char": 1226
  }
]

So, hopefully the models on the dev branch are a bit better, unless I was not replicating the issue correctly

0 replies

robertobartolini · 2023-01-20T12:38:47Z

robertobartolini
Jan 20, 2023
Author

Hi,
I wrote this little piece of code, in the input file there is a single line the sentence in question:

import stanza
def main_prova():
ti = open("input.txt")
testo = ti.readline()
doc = nlp(testo)
tf = open("fileLog.txt", "w")
print(testo, file=tf)
for sentence in doc.sentences:
for token in sentence.tokens:
print(token, file=tf)

this is what I get:
.....
[
{
"id": 225,
"text": ",",
"lemma": ",",
"upos": "PUNCT",
"xpos": "FF",
"head": 226,
"deprel": "punct",
"start_char": 1219,
"end_char": 1220,
"ner": "O",
"multi_ner": [
"O"
]
}
]
[
{
"id": 226,
"text": "Donno Esposito",
"lemma": "Donno Esposito",
"upos": "PROPN",
"xpos": "SP",
"head": 1,
"deprel": "",
"start_char": 1221,
"end_char": 1235,
"ner": "B-PER",
"multi_ner": [
"B-PER"
]
}
]
[
{
"id": 227,
"text": "Giuseppe",
"lemma": "Giuseppe",
"upos": "PROPN",
"xpos": "SP",
"head": 226,
"deprel": "flat:name",
"start_char": 1236,
"end_char": 1244,
"ner": "E-PER",
"multi_ner": [
"E-PER"
]
}
]

as you can see I always get the token "donno esposito" with the space in between, in other cases it is recognized as multiword "a regime" for example, and also in this case I receive a token with a space. I use python 3.10 and I Attach my setting in pycharm:

0 replies

robertobartolini · 2023-01-20T12:41:07Z

robertobartolini
Jan 20, 2023
Author

when I used stanza 1.40 the behavior was different and "donno esposito" was splitted in 2 tokens....

0 replies

AngledLuffa · 2023-01-20T15:57:00Z

AngledLuffa
Jan 20, 2023
Maintainer

How are you creating the "nlp" object?

…

On Fri, Jan 20, 2023, 4:41 AM robertobartolini ***@***.***> wrote: when I used stanza 1.40 the behavior was different and "donno esposito" was splitted in 2 tokens.... — Reply to this email directly, view it on GitHub <#1181 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWLGQCYKIRMC4CS4SW3WTKBW3ANCNFSM6AAAAAAT7FOIXY> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

robertobartolini · 2023-01-20T16:01:39Z

robertobartolini
Jan 20, 2023
Author

nlp = stanza.Pipeline('it', download_method=DownloadMethod.REUSE_RESOURCES)

I also tested by deleting the flag "download_method"

0 replies

AngledLuffa · 2023-01-20T17:15:54Z

AngledLuffa
Jan 20, 2023
Maintainer

What if you install the dev branch instead?

…

On Fri, Jan 20, 2023, 8:01 AM robertobartolini ***@***.***> wrote: nlp = stanza.Pipeline('it', download_method=DownloadMethod.REUSE_RESOURCES) I also tested by deleting the flag "download_method" — Reply to this email directly, view it on GitHub <#1181 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWIUQ3ZD35P6T7H3OE3WTKZG3ANCNFSM6AAAAAAT7FOIXY> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

robertobartolini · 2023-01-23T10:45:32Z

robertobartolini
Jan 23, 2023
Author

I'll try as soon as I can and let you know....

0 replies

robertobartolini · 2023-01-23T13:55:12Z

robertobartolini
Jan 23, 2023
Author

ok,
with the development version (stanza 1.50) the error does not occur....
in summary: with version 1.40 it is fine (and it's fine also with 1.50 that is Stanza-dev), while with both version 1.41 and 1.42 the reported error occurs.
Roberto.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

version 1.42 versus 1.40 #1181

{{title}}

Replies: 10 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

version 1.42 versus 1.40 #1181

robertobartolini Jan 18, 2023

Replies: 10 comments

AngledLuffa Jan 18, 2023 Maintainer

robertobartolini Jan 19, 2023 Author

AngledLuffa Jan 20, 2023 Maintainer

robertobartolini Jan 20, 2023 Author

robertobartolini Jan 20, 2023 Author

AngledLuffa Jan 20, 2023 Maintainer

robertobartolini Jan 20, 2023 Author

AngledLuffa Jan 20, 2023 Maintainer

robertobartolini Jan 23, 2023 Author

robertobartolini Jan 23, 2023 Author

robertobartolini
Jan 18, 2023

AngledLuffa
Jan 18, 2023
Maintainer

robertobartolini
Jan 19, 2023
Author

AngledLuffa
Jan 20, 2023
Maintainer

robertobartolini
Jan 20, 2023
Author

robertobartolini
Jan 20, 2023
Author

AngledLuffa
Jan 20, 2023
Maintainer

robertobartolini
Jan 20, 2023
Author

AngledLuffa
Jan 20, 2023
Maintainer

robertobartolini
Jan 23, 2023
Author

robertobartolini
Jan 23, 2023
Author