Explicit UTF-8 encoding for VASP input files with `zopen`, and `open` for other text files #4218

DanielYang59 · 2024-12-06T08:36:47Z

Summary

Explicit UTF-8 encoding when reading/writing VASP input files with zopen, to fix For band structure, how to correctly display the symbol of the \Gamma point #4214
Explicit UTF-8 encoding for standard open
pytest ignore UserWarning only by default instead of all warnings
Add test for non-ASCII char in Kpoints comment

Rationale

Using the default encoding is a common mistake

Developers using macOS or Linux may forget that the default encoding is not always UTF-8.

For example, using long_description = open("README.md").read() in setup.py is a common mistake. Many Windows users cannot install such packages if there is at least one non-ASCII character (e.g. emoji, author names, copyright symbols, and the like) in their UTF-8-encoded README.md file.

Of the 4000 most downloaded packages from PyPI, 489 use non-ASCII characters in their README, and 82 fail to install from source on non-UTF-8 locales due to not specifying an encoding for a non-ASCII file. [1]

Another example is logging.basicConfig(filename="log.txt"). Some users might expect it to use UTF-8 by default, but the locale encoding is actually what is used. [2]

Meanwhile PEP 686 – Make UTF-8 mode default should resolve this altogether but it wouldn't be in place until Python 3.15

Ruff rule: unspecified-encoding (PLW1514)

QuantumChemist · 2024-12-06T09:05:06Z

As you have started this already, let me know when I shall step in (if needed).

DanielYang59 · 2024-12-06T12:56:15Z

Thanks for saying that I could handle this as there aren't too many to fix across the entire code base (~300) and unsafe fixes from ruff is available, should be able to fix them pretty fast (though need to double check the ruff fix as it's in preview)

>>> ruff check . --select PLW1514 --preview

Found 282 errors.
No fixes available (282 hidden fixes can be enabled with the `--unsafe-fixes` option).

If you could help me review changes (if you have time), that would be wonderful :)

QuantumChemist · 2024-12-06T13:09:38Z

Yes, I have a bit of time to spare for a review 😎
Ping me when you are done ☺️

dev_scripts/update_pt_data.py

tests/phonon/test_thermal_displacements.py

DanielYang59 · 2024-12-07T01:48:34Z

@QuantumChemist I went through the changes myself and didn't see much issue (not trying to bias you), please help me review those changes if you have time, really appreciate that.

AFAIK, if we don't give a encoding in binary mode for open, there shouldn't be any issue.

I would separate the remaining change to zopen to a separate PR as this one is already huge, and I would need some investigation on the behaviour of zopen as it wraps around multiple gzip/bz2/lzma to make sure encoding arg works for them as well.

Thanks again!

QuantumChemist · 2024-12-09T20:34:21Z

@DanielYang59 , sorry this weekend was big exhaustion time and today I also had a day off but now I have the time and energy for reviewing x)

QuantumChemist

I think there is not a lot to review here xD
I'm just wondering how we could make sure that no "open ... " line was missed out 🤔

I can also test this implementation on Wednesday (will not be able to test it tomorrow) with a Γ in a KPOINTS file 👀

dev_scripts/update_pt_data.py

src/pymatgen/core/__init__.py

DanielYang59 · 2024-12-10T02:26:15Z

sorry this weekend was big exhaustion time and today I also had a day off but now I have the time and energy for reviewing x)

That's totally alright, no one would expect you to work on weekends/holidays :) Thanks a lot!

I can also test this implementation on Wednesday (will not be able to test it tomorrow) with a Γ in a KPOINTS file 👀

In fact I tested on my local machine, and of course additional test is always welcome (I should add a test in CI, in TODO list now). However a pitfall may be: your system could be using a default encoding where Γ is decoded correctly (apparently the default Chinese "GB" encoding is not one of them), so please make sure you could recreate the incorrect decoding error before testing the fix (otherwise you would get the correct result with and without this fix).

I'm just wondering how we could make sure that no "open ... " line was missed out

Thanks for noticing this, I originally used the ruff PLW1514 rule to identify them, ~~and indeed it seems to miss some (understandably, as that rule is still in preview mode, I would report the missing ones to them).~~

Update: looks like I was wrong here, upon a closer look I believe all open within code (some might be in comment 6a90d2d) now has encoding in text mode.

I believe this code shown with #4218 (comment) is at the time of comment and would not be updated automatically, i.e. the open in update_pt_data.py already has encoding.

Not really useful anymore

We might need regular expression for this, something like: \s+open\((?!.*encoding=).*?\)

However this is not "perfect" so I would still need to manually inspect its outcome (it's the best I could do within a rational time frame):

It would also report missing encoding for binary mode, this could be avoided by checking if "b" is in mode but I guess it's not worth the effort (mode could be positional so a complete rule is a bit hard to write I guess)
It makes the assumption that encoding is given as keyword arguments, something like open(file, "rt", -1, "utf-8") would be flagged (though I haven't seen much usage like this, it's totally unreadable)
It's greedy, meaning it may not handle multiple pairs of () correctly like open("encoding=.txt", ...) (crazy and very unlikely edge case though)
It cannot search across lines, recent VSCode support multi-line search

An re alternative might be running tests with the PYTHONWARNDEFAULTENCODING env var in PEP 597 and apply a warning filter to flag EncodingWarning as error (add error::EncodingWarning to pytest ini_options):

pymatgen/pyproject.toml

Line 250 in 31f1e1f

filterwarnings = [

tests/io/vasp/test_inputs.py

QuantumChemist · 2024-12-10T14:23:14Z

sorry this weekend was big exhaustion time and today I also had a day off but now I have the time and energy for reviewing x)

That's totally alright, no one would expect you to work on weekends/holidays :) Thanks a lot!

I know that nobody expects to work on weekends, I just thought I would have the time and energy but was not the case against my expectation.

I can also test this implementation on Wednesday (will not be able to test it tomorrow) with a Γ in a KPOINTS file 👀

In fact I tested on my local machine, and of course additional test is always welcome (I should add a test in CI, in TODO list now). However a pitfall may be: your system could be using a default encoding where Γ is decoded correctly (apparently the default Chinese "GB" encoding is not one of them), so please make sure you could recreate the incorrect decoding error before testing the fix (otherwise you would get the correct result with and without this fix).

I'm just wondering how we could make sure that no "open ... " line was missed out

Thanks for noticing this, I originally used the ruff PLW1514 rule to identify them, ~~and indeed it seems to miss some (understandably, as that rule is still in preview mode, I would report the missing ones to them).~~

Update: looks like I was wrong here, upon a closer look I believe all open within code (some might be in comment 6a90d2d) now has encoding in text mode.

I believe this code shown with #4218 (comment) is at the time of comment and would not be updated automatically, i.e. the open in update_pt_data.py already has encoding.

Not really useful anymore

An EncodingWarning in the pyproject.toml sounds like the most solid option to me.

DanielYang59 · 2024-12-12T02:27:38Z

I know that nobody expects to work on weekends, I just thought I would have the time and energy but was not the case against my expectation.

I guess there is always a gap between the ideal image of self and and real us :) You have already done so much BTW

An EncodingWarning in the pyproject.toml sounds like the most solid option to me.

Thanks for the comment, with the unspecified-encoding (PLW1514) rule enable open should have already been covered now, only thing left is zopen AFAIK so I might migrate this guard to #4222 otherwise test would fail from here (update: added in bbd53bf but need more tests to confirm it's really working).

QuantumChemist · 2024-12-12T09:53:45Z

I wasn't aware that EncodingWarning is only available from Python 3.13 on, though, so maybe your current setup is more than good enough already.

DanielYang59 · 2024-12-12T10:48:32Z

I wasn't aware that EncodingWarning is only available from Python 3.13 on, though, so maybe your current setup is more than good enough already.

I would see if I could get the warning filter to work in #4222 so let's discuss there maybe, apparently the filter I added in bbd53bf is not doing what I expected as no error is thrown after I intentionally removed some encoding in 2ccf30c

The EncodingWarning zopen issued is the following custom one but there must have been something wrong with my warning filter syntax:
https://github.com/materialsvirtuallab/monty/blob/26acf0b2900b5074143ed64ccd1bdea6ba9f6705/src/monty/io.py#L25

DanielYang59 · 2024-12-12T10:52:12Z

@QuantumChemist Sorry there was a major typo, the EncodingWarning was added in Python 3.10 not Python 3.13

The reason for the custom EncodingWarning in monty.io.zopen: monty only bumped min Python to Python 3.10 two days ago in materialsvirtuallab/monty#709 and at the time zopen change was made Python 3.9 was still supported, so the custom warning could be dropped now :)

QuantumChemist · 2024-12-12T13:15:17Z

@QuantumChemist Sorry there was a major typo, the EncodingWarning was added in Python 3.10 not Python 3.13

The reason for the custom EncodingWarning in monty.io.zopen: monty only bumped min Python to Python 3.10 two days ago in materialsvirtuallab/monty#709 and at the time zopen change was made Python 3.9 was still supported, so the custom warning could be dropped now :)

I see it was a typo xD

I would see if I could get the warning filter to work in #4222 so let's discuss there maybe

yeah, I will have a closer look in the other PR too :)

shyuep · 2024-12-12T22:35:18Z

Is this ready to be merged?

DanielYang59 · 2024-12-13T01:22:45Z

Is this ready to be merged?

Yes as far as I'm aware, thank you!

QuantumChemist · 2024-12-13T15:53:44Z

Is this ready to be merged?

yes, ready to be merged.

DanielYang59 added 2 commits December 6, 2024 16:35

explicit utf-8 encoding for kpoints from file

597ab65

explicit utf-8 elsewhere

5e41f1a

DanielYang59 changed the title ~~Explicit UTF-8 encoding when reading KPOINTS from file~~ Explicit UTF-8 encoding when reading/writing VASP input files Dec 6, 2024

fix root level and dev_scripts

1767195

simplify PMG PKG path

f575e74

DanielYang59 changed the title ~~Explicit UTF-8 encoding when reading/writing VASP input files~~ Explicit UTF-8 encoding when reading/writing VASP input files, and other open for text files Dec 6, 2024

fix analysis, cli, command_line

052e949

DanielYang59 commented Dec 6, 2024

View reviewed changes

dev_scripts/update_pt_data.py Show resolved Hide resolved

DanielYang59 added 2 commits December 6, 2024 21:35

fix electronic_structure, entries and ext

9d09765

fix io, phonon and symmetry

3f7b180

DanielYang59 changed the title ~~Explicit UTF-8 encoding when reading/writing VASP input files, and other open for text files~~ Explicit UTF-8 encoding for VASP input files with zopen, and open for other text files Dec 6, 2024

fix alchemy and anlysis tests

bd90e90

DanielYang59 force-pushed the kpoints-encoding branch from 7ce8014 to bd90e90 Compare December 6, 2024 13:47

DanielYang59 added 3 commits December 6, 2024 21:51

fix apps, command_line, core, elec_struct, entries, ext and vis tests

5b8ced4

finish io and phonon tests

b8d3b75

remove unnecessary seek

c54d772

DanielYang59 force-pushed the kpoints-encoding branch from 448ec77 to c54d772 Compare December 6, 2024 14:05

DanielYang59 commented Dec 6, 2024

View reviewed changes

tests/phonon/test_thermal_displacements.py Show resolved Hide resolved

DanielYang59 added 2 commits December 6, 2024 22:30

revert encoding for json dump

bea91bd

type custom paths

e58a4ed

DanielYang59 mentioned this pull request Dec 6, 2024

For band structure, how to correctly display the symbol of the \Gamma point #4214

Closed

DanielYang59 added 5 commits December 6, 2024 22:45

revert another json dump

8a0490c

ignore userwarning by default

0d9de77

relocate test-only env var

5af79f7

remove unneeded default tag for non-userwarning

308597a

also explicit utf-8 for json dump though forced ASCII

1cd1aac

DanielYang59 marked this pull request as ready for review December 7, 2024 01:48

DanielYang59 requested review from shyuep and mkhorton as code owners December 7, 2024 01:48

DanielYang59 mentioned this pull request Dec 7, 2024

zopen changes: forbid implicit binary/text mode, signature change, default UTF-8 encoding in text mode, drop .z support after one-year materialsvirtuallab/monty#730

Merged

2 tasks

utf8 is alias to utf-8 in codecs, but maybe prefer the standard name

4206b7d

QuantumChemist reviewed Dec 9, 2024

View reviewed changes

dev_scripts/update_pt_data.py Show resolved Hide resolved

src/pymatgen/core/__init__.py Show resolved Hide resolved

DanielYang59 marked this pull request as draft December 10, 2024 01:57

DanielYang59 added 2 commits December 10, 2024 12:45

fix missing encoding in comment

6a90d2d

add test for Γ decoding

436356f

DanielYang59 commented Dec 10, 2024

View reviewed changes

tests/io/vasp/test_inputs.py Outdated Show resolved Hide resolved

better error message

2608e8a

DanielYang59 marked this pull request as ready for review December 10, 2024 06:11

shyuep and others added 3 commits December 10, 2024 18:00

Merge branch 'master' into kpoints-encoding

9259f13

Merge branch 'master' into kpoints-encoding

ff46384

Merge branch 'master' into kpoints-encoding

59148a0

Merge branch 'master' into kpoints-encoding

140b8b1

Merge remote-tracking branch 'upstream/master' into kpoints-encoding

25e5a38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicit UTF-8 encoding for VASP input files with `zopen`, and `open` for other text files #4218

Explicit UTF-8 encoding for VASP input files with `zopen`, and `open` for other text files #4218

DanielYang59 commented Dec 6, 2024 •

edited

Loading

QuantumChemist commented Dec 6, 2024

DanielYang59 commented Dec 6, 2024 •

edited

Loading

QuantumChemist commented Dec 6, 2024

DanielYang59 commented Dec 7, 2024 •

edited

Loading

QuantumChemist commented Dec 9, 2024

QuantumChemist left a comment

DanielYang59 commented Dec 10, 2024 •

edited

Loading

QuantumChemist commented Dec 10, 2024

DanielYang59 commented Dec 12, 2024 •

edited

Loading

QuantumChemist commented Dec 12, 2024

DanielYang59 commented Dec 12, 2024

DanielYang59 commented Dec 12, 2024 •

edited

Loading

QuantumChemist commented Dec 12, 2024

shyuep commented Dec 12, 2024

DanielYang59 commented Dec 13, 2024

QuantumChemist commented Dec 13, 2024

Explicit UTF-8 encoding for VASP input files with zopen, and open for other text files #4218

Are you sure you want to change the base?

Explicit UTF-8 encoding for VASP input files with zopen, and open for other text files #4218

Conversation

DanielYang59 commented Dec 6, 2024 • edited Loading

Summary

Rationale

QuantumChemist commented Dec 6, 2024

DanielYang59 commented Dec 6, 2024 • edited Loading

QuantumChemist commented Dec 6, 2024

DanielYang59 commented Dec 7, 2024 • edited Loading

QuantumChemist commented Dec 9, 2024

QuantumChemist left a comment

Choose a reason for hiding this comment

DanielYang59 commented Dec 10, 2024 • edited Loading

QuantumChemist commented Dec 10, 2024

DanielYang59 commented Dec 12, 2024 • edited Loading

QuantumChemist commented Dec 12, 2024

DanielYang59 commented Dec 12, 2024

DanielYang59 commented Dec 12, 2024 • edited Loading

QuantumChemist commented Dec 12, 2024

shyuep commented Dec 12, 2024

DanielYang59 commented Dec 13, 2024

QuantumChemist commented Dec 13, 2024

Explicit UTF-8 encoding for VASP input files with `zopen`, and `open` for other text files #4218

Explicit UTF-8 encoding for VASP input files with `zopen`, and `open` for other text files #4218

DanielYang59 commented Dec 6, 2024 •

edited

Loading

DanielYang59 commented Dec 6, 2024 •

edited

Loading

DanielYang59 commented Dec 7, 2024 •

edited

Loading

DanielYang59 commented Dec 10, 2024 •

edited

Loading

DanielYang59 commented Dec 12, 2024 •

edited

Loading

DanielYang59 commented Dec 12, 2024 •

edited

Loading