You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @fabiocapsouza, I think you have some encoding issues with the HAREM dataset. Take a look at the first sample of FirstHAREM-total-train.json. Words like "ASSOCIAÇÃO" are presented as "ASSOCIA\u00c7\u00c3O" no matter what encoding you try to use to open the file.
Looking at the pre-processing scripts that you've used seems that you didn't force the encoding while opening the HAREM XML files (that are originally encoded on WIndows1252, I think). That's probably the root of this encoding issue.
The text was updated successfully, but these errors were encountered:
Hi @fabiocapsouza, I think you have some encoding issues with the HAREM dataset. Take a look at the first sample of FirstHAREM-total-train.json. Words like "ASSOCIAÇÃO" are presented as "ASSOCIA\u00c7\u00c3O" no matter what encoding you try to use to open the file.
Looking at the pre-processing scripts that you've used seems that you didn't force the encoding while opening the HAREM XML files (that are originally encoded on WIndows1252, I think). That's probably the root of this encoding issue.
The text was updated successfully, but these errors were encountered: