Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect character encoding of YAML files and ignore files #630

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Commits on Feb 25, 2024

  1. tests: Use correct encoding for path

    Before this change, build_temp_workspace() would always encode a path
    using UTF-8 and the strict error handler [1]. Most of the time, this is
    fine, but systems do not necessarily use UTF-8 and the strict error
    handler for paths [2].
    
    [1]: <https://docs.python.org/3.12/library/stdtypes.html#str.encode>
    [2]: <https://docs.python.org/3.12/glossary.html#term-filesystem-encoding-and-error-handler>
    Jayman2000 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    a6203cd View commit details
    Browse the repository at this point in the history
  2. tests: Restore stdout and stderr

    Before this commit, test_run_default_format_output_in_tty() changed the
    values of sys.stdout and sys.stderr, but it would never change them
    back. This commit makes sure that they get changed back.
    
    At the moment, this commit doesn’t make a user-visible difference. A
    future commit will add a new test named test_ignored_from_file_with_multiple_encodings().
    That new test requires stdout and stderr to be restored, or else it will
    fail.
    Jayman2000 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    95b4544 View commit details
    Browse the repository at this point in the history
  3. decoder: Autodetect detect encoding of YAML files

    Before this change, yamllint would open YAML files using open()’s
    default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
    to using the system’s locale encoding [1][2].
    
    Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
    but it doesn’t have to be [5]. Additionally, the locale encoding on
    Windows systems is the system’s ANSI code page [6]. As a result, you
    would have to either enable UTF-8 mode, give Python a custom manifest or
    enable a beta feature in Windows settings in order to lint UTF-8 YAML
    files on Windows [2][7].
    
    Finally, using open()’s default encoding is a violation of the YAML
    spec. Chapter 5.2 says:
    
    	“On input, a YAML processor must support the UTF-8 and UTF-16
    	character encodings. For JSON compatibility, the UTF-32
    	encodings must also be supported.
    
    	If a character stream begins with a byte order mark, the
    	character encoding will be taken to be as indicated by the byte
    	order mark. Otherwise, the stream must begin with an ASCII
    	character. This allows the encoding to be deduced by the pattern
    	of null (x00) characters.” [8]
    
    This change fixes all of those problems by implementing the YAML spec’s
    character encoding detection algorithm. Now, as long as YAML files
    begins with either a byte order mark or an ASCII character, yamllint
    will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
    character encodings are not supported at the moment.
    
    Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.
    
    [1]: <https://docs.python.org/3.12/library/functions.html#open>
    [2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
    [3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
    [4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
    [5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
    [6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
    [7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
    [8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
    Jayman2000 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    3279290 View commit details
    Browse the repository at this point in the history
  4. decoder: Autodetect encoding for ignore-from-file

    Before this change, yamllint would decode files on the ignore-from-file
    list using open()’s default encoding [1][2]. This can cause decoding to
    fail on some systems and succeed on other systems (see the previous
    commit message for details).
    
    This change makes yamllint automatically detect the encoding for files
    on the ignore-from-file list. It uses the same algorithm that it uses
    for detecting the encoding of YAML files, so the same limitations apply:
    files must use UTF-8, UTF-16 or UTF-32 and they must begin with either a
    byte order mark or an ASCII character.
    
    [1]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.input>
    [2]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.FileInput>
    Jayman2000 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    b822abe View commit details
    Browse the repository at this point in the history
  5. tests: Stop using open()’s default encoding

    In general, using open()’s default encoding is a mistake [1]. This
    change makes sure that every time open() is called, the encoding
    parameter is specified. Specifically, it makes it so that all tests
    succeed when run like this:
    
    	python -X warn_default_encoding -W error::EncodingWarning -m unittest discover
    
    [1]: <https://peps.python.org/pep-0597/#using-the-default-encoding-is-a-common-mistake>
    Jayman2000 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    dd8318a View commit details
    Browse the repository at this point in the history
  6. CI: Fail when open()’s default encoding is used

    The previous few commits have removed all calls to open() that use its
    default encoding. That being said, it’s still possible that code added
    in the future will contain that same mistake. This commit makes it so
    that the CI test job will fail if that mistake is made again.
    
    Unfortunately, it doesn’t look like coverage.py allows you to specify -X
    options [1] or warning filters [2] when running your tests [3]. As a
    result, the CI test job will also fail if coverage.py uses open()’s
    default encoding. Hopefully, coverage.py won’t do that. If it does, then
    we can always temporarily revert this commit.
    
    [1]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-X>
    [2]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-W>
    [3]: <https://coverage.readthedocs.io/en/7.4.0/cmd.html#execution-coverage-run>
    Jayman2000 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    d569de6 View commit details
    Browse the repository at this point in the history