Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue with HAREM dataset #26

Open
jonatasgrosman opened this issue Dec 21, 2020 · 0 comments
Open

Encoding issue with HAREM dataset #26

jonatasgrosman opened this issue Dec 21, 2020 · 0 comments

Comments

@jonatasgrosman
Copy link

Hi @fabiocapsouza, I think you have some encoding issues with the HAREM dataset. Take a look at the first sample of FirstHAREM-total-train.json. Words like "ASSOCIAÇÃO" are presented as "ASSOCIA\u00c7\u00c3O" no matter what encoding you try to use to open the file.

Looking at the pre-processing scripts that you've used seems that you didn't force the encoding while opening the HAREM XML files (that are originally encoded on WIndows1252, I think). That's probably the root of this encoding issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant