Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit UTF-8 encoding for VASP input files with zopen, and open for other text files #4218

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

DanielYang59
Copy link
Contributor

@DanielYang59 DanielYang59 commented Dec 6, 2024

Summary

Rationale

PEP 597 – Add optional EncodingWarning:

Using the default encoding is a common mistake

Developers using macOS or Linux may forget that the default encoding is not always UTF-8.

For example, using long_description = open("README.md").read() in setup.py is a common mistake. Many Windows users cannot install such packages if there is at least one non-ASCII character (e.g. emoji, author names, copyright symbols, and the like) in their UTF-8-encoded README.md file.

Of the 4000 most downloaded packages from PyPI, 489 use non-ASCII characters in their README, and 82 fail to install from source on non-UTF-8 locales due to not specifying an encoding for a non-ASCII file. [1]

Another example is logging.basicConfig(filename="log.txt"). Some users might expect it to use UTF-8 by default, but the locale encoding is actually what is used. [2]

Meanwhile PEP 686 – Make UTF-8 mode default should resolve this altogether but it wouldn't be in place until Python 3.15

Ruff rule: unspecified-encoding (PLW1514)

@DanielYang59 DanielYang59 changed the title Explicit UTF-8 encoding when reading KPOINTS from file Explicit UTF-8 encoding when reading/writing VASP input files Dec 6, 2024
@QuantumChemist
Copy link
Contributor

As you have started this already, let me know when I shall step in (if needed).

@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Dec 6, 2024

Thanks for saying that I could handle this as there aren't too many to fix across the entire code base (~300) and unsafe fixes from ruff is available, should be able to fix them pretty fast (though need to double check the ruff fix as it's in preview)

>>> ruff check . --select PLW1514 --preview

Found 282 errors.
No fixes available (282 hidden fixes can be enabled with the `--unsafe-fixes` option).

If you could help me review changes (if you have time), that would be wonderful :)

@QuantumChemist
Copy link
Contributor

Yes, I have a bit of time to spare for a review 😎
Ping me when you are done ☺️

@DanielYang59 DanielYang59 changed the title Explicit UTF-8 encoding when reading/writing VASP input files Explicit UTF-8 encoding when reading/writing VASP input files, and other open for text files Dec 6, 2024
@DanielYang59 DanielYang59 changed the title Explicit UTF-8 encoding when reading/writing VASP input files, and other open for text files Explicit UTF-8 encoding for VASP input files with zopen, and open for other text files Dec 6, 2024
@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Dec 7, 2024

@QuantumChemist I went through the changes myself and didn't see much issue (not trying to bias you), please help me review those changes if you have time, really appreciate that.

AFAIK, if we don't give a encoding in binary mode for open, there shouldn't be any issue.

I would separate the remaining change to zopen to a separate PR as this one is already huge, and I would need some investigation on the behaviour of zopen as it wraps around multiple gzip/bz2/lzma to make sure encoding arg works for them as well.

Thanks again!

@QuantumChemist
Copy link
Contributor

@DanielYang59 , sorry this weekend was big exhaustion time and today I also had a day off but now I have the time and energy for reviewing x)

Copy link
Contributor

@QuantumChemist QuantumChemist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is not a lot to review here xD
I'm just wondering how we could make sure that no "open ... " line was missed out 🤔

I can also test this implementation on Wednesday (will not be able to test it tomorrow) with a Γ in a KPOINTS file 👀

dev_scripts/update_pt_data.py Show resolved Hide resolved
src/pymatgen/core/__init__.py Show resolved Hide resolved
@DanielYang59 DanielYang59 marked this pull request as draft December 10, 2024 01:57
@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Dec 10, 2024

sorry this weekend was big exhaustion time and today I also had a day off but now I have the time and energy for reviewing x)

That's totally alright, no one would expect you to work on weekends/holidays :) Thanks a lot!


I can also test this implementation on Wednesday (will not be able to test it tomorrow) with a Γ in a KPOINTS file 👀

In fact I tested on my local machine, and of course additional test is always welcome (I should add a test in CI, in TODO list now). However a pitfall may be: your system could be using a default encoding where Γ is decoded correctly (apparently the default Chinese "GB" encoding is not one of them), so please make sure you could recreate the incorrect decoding error before testing the fix (otherwise you would get the correct result with and without this fix).


I'm just wondering how we could make sure that no "open ... " line was missed out

Thanks for noticing this, I originally used the ruff PLW1514 rule to identify them, and indeed it seems to miss some (understandably, as that rule is still in preview mode, I would report the missing ones to them).

Update: looks like I was wrong here, upon a closer look I believe all open within code (some might be in comment 6a90d2d) now has encoding in text mode.

I believe this code shown with #4218 (comment) is at the time of comment and would not be updated automatically, i.e. the open in update_pt_data.py already has encoding.

Not really useful anymore

We might need regular expression for this, something like: \s+open\((?!.*encoding=).*?\)

However this is not "perfect" so I would still need to manually inspect its outcome (it's the best I could do within a rational time frame):

  • It would also report missing encoding for binary mode, this could be avoided by checking if "b" is in mode but I guess it's not worth the effort (mode could be positional so a complete rule is a bit hard to write I guess)
  • It makes the assumption that encoding is given as keyword arguments, something like open(file, "rt", -1, "utf-8") would be flagged (though I haven't seen much usage like this, it's totally unreadable)
  • It's greedy, meaning it may not handle multiple pairs of () correctly like open("encoding=.txt", ...) (crazy and very unlikely edge case though)
  • It cannot search across lines, recent VSCode support multi-line search

An re alternative might be running tests with the PYTHONWARNDEFAULTENCODING env var in PEP 597 and apply a warning filter to flag EncodingWarning as error (add error::EncodingWarning to pytest ini_options):

filterwarnings = [

@DanielYang59 DanielYang59 marked this pull request as ready for review December 10, 2024 06:11
@QuantumChemist
Copy link
Contributor

sorry this weekend was big exhaustion time and today I also had a day off but now I have the time and energy for reviewing x)

That's totally alright, no one would expect you to work on weekends/holidays :) Thanks a lot!

I know that nobody expects to work on weekends, I just thought I would have the time and energy but was not the case against my expectation.

I can also test this implementation on Wednesday (will not be able to test it tomorrow) with a Γ in a KPOINTS file 👀

In fact I tested on my local machine, and of course additional test is always welcome (I should add a test in CI, in TODO list now). However a pitfall may be: your system could be using a default encoding where Γ is decoded correctly (apparently the default Chinese "GB" encoding is not one of them), so please make sure you could recreate the incorrect decoding error before testing the fix (otherwise you would get the correct result with and without this fix).

I'm just wondering how we could make sure that no "open ... " line was missed out

Thanks for noticing this, I originally used the ruff PLW1514 rule to identify them, and indeed it seems to miss some (understandably, as that rule is still in preview mode, I would report the missing ones to them).

Update: looks like I was wrong here, upon a closer look I believe all open within code (some might be in comment 6a90d2d) now has encoding in text mode.

I believe this code shown with #4218 (comment) is at the time of comment and would not be updated automatically, i.e. the open in update_pt_data.py already has encoding.

Not really useful anymore

An EncodingWarning in the pyproject.toml sounds like the most solid option to me.

@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Dec 12, 2024

I know that nobody expects to work on weekends, I just thought I would have the time and energy but was not the case against my expectation.

I guess there is always a gap between the ideal image of self and and real us :) You have already done so much BTW

An EncodingWarning in the pyproject.toml sounds like the most solid option to me.

Thanks for the comment, with the unspecified-encoding (PLW1514) rule enable open should have already been covered now, only thing left is zopen AFAIK so I might migrate this guard to #4222 otherwise test would fail from here (update: added in bbd53bf but need more tests to confirm it's really working).

@QuantumChemist
Copy link
Contributor

I wasn't aware that EncodingWarning is only available from Python 3.13 on, though, so maybe your current setup is more than good enough already.

@DanielYang59
Copy link
Contributor Author

I wasn't aware that EncodingWarning is only available from Python 3.13 on, though, so maybe your current setup is more than good enough already.

I would see if I could get the warning filter to work in #4222 so let's discuss there maybe, apparently the filter I added in bbd53bf is not doing what I expected as no error is thrown after I intentionally removed some encoding in 2ccf30c

The EncodingWarning zopen issued is the following custom one but there must have been something wrong with my warning filter syntax:
https://github.com/materialsvirtuallab/monty/blob/26acf0b2900b5074143ed64ccd1bdea6ba9f6705/src/monty/io.py#L25

@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Dec 12, 2024

@QuantumChemist Sorry there was a major typo, the EncodingWarning was added in Python 3.10 not Python 3.13

The reason for the custom EncodingWarning in monty.io.zopen: monty only bumped min Python to Python 3.10 two days ago in materialsvirtuallab/monty#709 and at the time zopen change was made Python 3.9 was still supported, so the custom warning could be dropped now :)

@QuantumChemist
Copy link
Contributor

@QuantumChemist Sorry there was a major typo, the EncodingWarning was added in Python 3.10 not Python 3.13

The reason for the custom EncodingWarning in monty.io.zopen: monty only bumped min Python to Python 3.10 two days ago in materialsvirtuallab/monty#709 and at the time zopen change was made Python 3.9 was still supported, so the custom warning could be dropped now :)

I see it was a typo xD

I would see if I could get the warning filter to work in #4222 so let's discuss there maybe

yeah, I will have a closer look in the other PR too :)

@shyuep
Copy link
Member

shyuep commented Dec 12, 2024

Is this ready to be merged?

@DanielYang59
Copy link
Contributor Author

Is this ready to be merged?

Yes as far as I'm aware, thank you!

@QuantumChemist
Copy link
Contributor

Is this ready to be merged?

yes, ready to be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

For band structure, how to correctly display the symbol of the \Gamma point
3 participants