Regression in tokenizer handling of `\r` #128233

tusharsadhwani · 2024-12-25T00:48:05Z

Bug report

Bug description:

Python 3.12 onwards we get a weird \r} token when trying to parse a file just containing '{\r}':

$ printf '{\r}' | python3.11 -m tokenize
1,0-1,1:            OP             '{'            
1,1-1,2:            ERRORTOKEN     '\r'           
1,2-1,3:            OP             '}'            
1,3-1,4:            NEWLINE        ''             
2,0-2,0:            ENDMARKER      ''             

$ printf '{\r}' | python3.12 -m tokenize
1,0-1,1:            OP             '{'            
1,1-1,3:            OP             '\r}'          
1,3-1,4:            NEWLINE        ''             
2,0-2,0:            ENDMARKER      ''

Weirdly, AST generation passes just fine in both cases:

$ printf '{\r}' | python3.11 -m ast
Module(
   body=[
      Expr(
         value=Dict(keys=[], values=[]))],
   type_ignores=[])

$ printf '{\r}' | python3.12 -m ast
Module(
   body=[
      Expr(
         value=Dict(keys=[], values=[]))],
   type_ignores=[])

Expected behaviour

I'd expect the \r to yield a NL instead, and we get a } OP as expected.

CPython versions tested on:

3.11, 3.12, 3.13, 3.14

Operating systems tested on:

macOS

The text was updated successfully, but these errors were encountered:

tusharsadhwani · 2024-12-25T15:36:57Z

There's one more interesting one, when the tokenizer seems to think that '\r ' is a non-whitespace token:

$ printf 'foo\n\r \nbar' | python3.11 -m tokenize
1,0-1,3:            NAME           'foo'          
1,3-1,4:            NEWLINE        '\n'           
2,0-2,3:            NL             '\r \n'        
3,0-3,3:            NAME           'bar'          
3,3-3,4:            NEWLINE        ''             
4,0-4,0:            ENDMARKER      ''             

$ printf 'foo\n\r \nbar' | python3.12 -m tokenize                  
1,0-1,3:            NAME           'foo'          
1,3-1,4:            NEWLINE        '\n'           
2,0-2,2:            OP             '\r '          
2,2-2,3:            NEWLINE        '\n'           
3,0-3,3:            NAME           'bar'          
3,3-3,4:            NEWLINE        ''             
4,0-4,0:            ENDMARKER      ''

I would have expected the 2,2-2,3: NEWLINE '\n' case in Python 3.12 to be NL instead, as there is no semantic meaning to that newline. Python 3.11 categorizes that correctly as NL.

serhiy-storchaka · 2024-12-30T16:47:56Z

The behavior was changed because the tokenize module now uses the C tokenizer. And it turned out that a single CR is not always recognized as a newline separator. I tried to fix this today, but it is not so easy. Starting from the fact that the code is read by lines, but C's fgets() and Python's readline() read to the LF byte (this works for the CRLF newlines, but not for CR).

tusharsadhwani added the type-bug An unexpected behavior, bug, or error label Dec 25, 2024

tusharsadhwani mentioned this issue Dec 25, 2024

Black crashes on files containing \r, from e.g. old MacOS psf/black#3700

Open

picnixz added the topic-parser label Dec 25, 2024

serhiy-storchaka self-assigned this Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in tokenizer handling of `\r` #128233

Regression in tokenizer handling of `\r` #128233

tusharsadhwani commented Dec 25, 2024 •

edited by github-actions bot

Loading

tusharsadhwani commented Dec 25, 2024

serhiy-storchaka commented Dec 30, 2024

Regression in tokenizer handling of \r #128233

Regression in tokenizer handling of \r #128233

Comments

tusharsadhwani commented Dec 25, 2024 • edited by github-actions bot Loading

Bug report

Bug description:

Expected behaviour

CPython versions tested on:

Operating systems tested on:

tusharsadhwani commented Dec 25, 2024

serhiy-storchaka commented Dec 30, 2024

Regression in tokenizer handling of `\r` #128233

Regression in tokenizer handling of `\r` #128233

tusharsadhwani commented Dec 25, 2024 •

edited by github-actions bot

Loading