Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in tokenizer handling of \r #128233

Open
tusharsadhwani opened this issue Dec 25, 2024 · 2 comments
Open

Regression in tokenizer handling of \r #128233

tusharsadhwani opened this issue Dec 25, 2024 · 2 comments
Assignees
Labels
topic-parser type-bug An unexpected behavior, bug, or error

Comments

@tusharsadhwani
Copy link
Contributor

tusharsadhwani commented Dec 25, 2024

Bug report

Bug description:

Python 3.12 onwards we get a weird \r} token when trying to parse a file just containing '{\r}':

$ printf '{\r}' | python3.11 -m tokenize
1,0-1,1:            OP             '{'            
1,1-1,2:            ERRORTOKEN     '\r'           
1,2-1,3:            OP             '}'            
1,3-1,4:            NEWLINE        ''             
2,0-2,0:            ENDMARKER      ''             

$ printf '{\r}' | python3.12 -m tokenize
1,0-1,1:            OP             '{'            
1,1-1,3:            OP             '\r}'          
1,3-1,4:            NEWLINE        ''             
2,0-2,0:            ENDMARKER      ''   

Weirdly, AST generation passes just fine in both cases:

$ printf '{\r}' | python3.11 -m ast
Module(
   body=[
      Expr(
         value=Dict(keys=[], values=[]))],
   type_ignores=[])

$ printf '{\r}' | python3.12 -m ast
Module(
   body=[
      Expr(
         value=Dict(keys=[], values=[]))],
   type_ignores=[])

Expected behaviour

I'd expect the \r to yield a NL instead, and we get a } OP as expected.

CPython versions tested on:

3.11, 3.12, 3.13, 3.14

Operating systems tested on:

macOS

@tusharsadhwani
Copy link
Contributor Author

There's one more interesting one, when the tokenizer seems to think that '\r ' is a non-whitespace token:

$ printf 'foo\n\r \nbar' | python3.11 -m tokenize
1,0-1,3:            NAME           'foo'          
1,3-1,4:            NEWLINE        '\n'           
2,0-2,3:            NL             '\r \n'        
3,0-3,3:            NAME           'bar'          
3,3-3,4:            NEWLINE        ''             
4,0-4,0:            ENDMARKER      ''             

$ printf 'foo\n\r \nbar' | python3.12 -m tokenize                  
1,0-1,3:            NAME           'foo'          
1,3-1,4:            NEWLINE        '\n'           
2,0-2,2:            OP             '\r '          
2,2-2,3:            NEWLINE        '\n'           
3,0-3,3:            NAME           'bar'          
3,3-3,4:            NEWLINE        ''             
4,0-4,0:            ENDMARKER      ''    

I would have expected the 2,2-2,3: NEWLINE '\n' case in Python 3.12 to be NL instead, as there is no semantic meaning to that newline. Python 3.11 categorizes that correctly as NL.

@serhiy-storchaka
Copy link
Member

The behavior was changed because the tokenize module now uses the C tokenizer. And it turned out that a single CR is not always recognized as a newline separator. I tried to fix this today, but it is not so easy. Starting from the fact that the code is read by lines, but C's fgets() and Python's readline() read to the LF byte (this works for the CRLF newlines, but not for CR).

@serhiy-storchaka serhiy-storchaka self-assigned this Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-parser type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants