Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-90533: Implement BytesIO.peek() #30808

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

marcelm
Copy link

@marcelm marcelm commented Jan 22, 2022

Closes gh-90533

@the-knights-who-say-ni

This comment was marked as outdated.

@cpython-cla-bot
Copy link

cpython-cla-bot bot commented Apr 10, 2022

All commit authors signed the Contributor License Agreement.
CLA signed

@marcelm marcelm changed the title bpo-46375: Implement BytesIO.peek() gh-90533: Implement BytesIO.peek() Apr 11, 2022
@marcelm

This comment was marked as outdated.

@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@marcelm marcelm force-pushed the fix-issue-46375 branch 2 times, most recently from 6998be1 to be39ff2 Compare November 6, 2022 15:11
@marcelm
Copy link
Author

marcelm commented Nov 9, 2022

@AlexWaygood You’ve been the only human to interact with this PR so far, do you possibly have any advice on how to move this forward?

@AlexWaygood AlexWaygood added the type-feature A feature request or enhancement label Nov 9, 2022
@AlexWaygood
Copy link
Member

AlexWaygood commented Nov 9, 2022

Hi @marcelm — sorry for the delay in anybody looking at this. I haven't studied your PR in detail (or thought about whether the proposal is a good idea), but it looks well put together at first glance.

I'll try to take a look soon. I won't be able to review the C code, but I can comment on whether the proposal seems like a good idea, and I can review the Python implementation and the tests.

Note that if this proposal is accepted, it will also need:

  • Updates to the docs for BytesIO (Doc/library/io.rst)
  • An entry in "What's new in Python 3.12" (Doc/whatsnew/3.12.rst)

You can also add yourself to Misc/ACKS as part of this PR, and give yourself credit in the NEWS entry, if you like. (Neither is compulsory.)

@AlexWaygood AlexWaygood self-requested a review November 9, 2022 13:37
@marcelm
Copy link
Author

marcelm commented Nov 9, 2022

Thank you! I pushed a documentation update and will add an entry to the What’s new document in case the PR is reviewed favorably.

@AlexWaygood
Copy link
Member

I'll defer to @benjaminp's judgement on this one -- I'm really not the right reviewer for this, unfortunately :(

Please ping me again if you still haven't had a review in a few weeks.

@AlexWaygood AlexWaygood removed their request for review November 28, 2022 12:26
@marcelm
Copy link
Author

marcelm commented Nov 28, 2022

Thank you! I appreciate that you took the time.

@awalgarg
Copy link

awalgarg commented Jul 8, 2023

Please ping me again if you still haven't had a review in a few weeks.

@AlexWaygood Hi! Been 8 months, no review so far :) How can we take this forward? I'm not the original author of the PR but I also have a use-case for this API and would like to see this change land, happy to help progress this.

@marcelm
Copy link
Author

marcelm commented Jul 8, 2023

Thanks for your interest! I have updated the PR to fix the merge conflict and to reflect that it now needs to target 3.13.

@awalgarg
Copy link

awalgarg commented Jul 8, 2023

@marcelm That was quick, thank you so much!

@pitrou
Copy link
Member

pitrou commented Sep 29, 2023

For a new function like BytesIO.peek(), would it make sense to do it?

Why not, but it would also deviate from the BufferedReader behaviour, which I think defeats the point of this proposal?

The complicated part of a memoryview is that BytesIO can be read and written. How do you guarantee that the view is not going to change after peek() if a write() is done later?

You wouldn't need to guarantee anything IMHO.

@vstinner
Copy link
Member

You wouldn't need to guarantee anything IMHO.

I prefer that when I read data form the filesystem, the data doesn't change, even if I use it 10 seconds later, or 1 day later. Or for my needs, apparently, bytes is the good type. It's less efficient, but it works as expected.

Maybe a new API is needed for memoryview which has a different semantics.

@pitrou
Copy link
Member

pitrou commented Sep 29, 2023

That was just a random suggestion, no need to act on it here.

@marcelm
Copy link
Author

marcelm commented Sep 29, 2023

I’ve now taken the liberty of changing the BufferedReader.peek() documentation as I suggested above. I think I’ve addressed all comments.

Doc/library/io.rst Outdated Show resolved Hide resolved
@marcelm
Copy link
Author

marcelm commented Oct 20, 2023

Hi, just a gentle reminder that this is ready from my side.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that peek(0) returns an empty bytes string.

At least one byte of data is returned if not at EOF.
Return an empty :class:`bytes` object at EOF.
If the size argument is less than one or larger than the number of available bytes,
a copy of the buffer from the current position until the end is returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand that size=-1 or size=None return the whole content. But I'm surprised that size=0 returns something different than an empty string or raise an exception.

I suggest to return an empty string when peek(0) is called, it would be similar to read(0).

Copy link
Author

@marcelm marcelm Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, changed now.

This was originally for (perceived) consistency with BufferedReader.peek(), which does not return empty bytes objects for size=0. But then BufferedReader.peek() ignores the size anyway.

Lib/_pyio.py Outdated
# even if the size is greater than the buffer length or
# the position is beyond the end of the buffer
if size < 1:
size = len(self._buffer) - self._pos
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks kind of complicated, whereas you can just do:

if size < 1:
    return self._buffer[self._pos:]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

with self.ioclass(buf) as memio:
self.assertEqual(memio.tell(), 0)
self.assertEqual(memio.peek(1), buf[:1])
self.assertEqual(memio.peek(1), buf[:1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add tests reading 3 and 5 bytes? It seems like most tests read 1 byte or everything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, added now.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for multiple updates!

@erlend-aasland @pitrou: Are you ok that peek(0) returns an empty string? IMO it makes the io module more consistent between peek(0) and read(0). In a previous version, peek(1) returned 1 byte, whereas peek(0) returned all remaining bytes (full content).

@pitrou
Copy link
Member

pitrou commented Oct 31, 2023

What does peek(0) do on other buffered input objects?

@marcelm
Copy link
Author

marcelm commented Oct 31, 2023

io.BufferedReader.peek(0) returns the remaining buffer contents, but there’s nothing special about 0: It ignores the size argument and always returns the remaining buffer.

The following table lists all peek() methods I could find in the standard library. All delegate to io.BufferedReader.peek (except for io.BufferedReader.peek itself) and therefore have the same behavior:

method signature
io.BufferedReader.peek peek(size=0, /)
bz2.BZ2File.peek peek([n])
lzma.LZMAFile.peek peek(size=-1)
gzip.GzipFile.peek peek(n)
http.client.HTTPResponse.peek (undocumented)

I’d still say it is ok for BytesIO.peek(0) to do the logical thing and return an empty bytes object since the behavior is fuzzily documented anyway at the moment ("The number of bytes returned may be less or more than requested."). I don’t see where that could lead to (additional) confusion. Also, returning a copy of the full remainder of the buffer could be expensive as BytesIO objects can potentially by large.

@pitrou
Copy link
Member

pitrou commented Nov 1, 2023

I don’t see where that could lead to (additional) confusion. Also, returning a copy of the full remainder of the buffer could be expensive as BytesIO objects can potentially by large.

Well, people calling peek(0) currently may rely on the fact that it returns non-zero bytes. I agree with not copying the full buffer remainder, but you could return a smallish harcoded number of bytes, such as, I don't know, 128.

@vstinner
Copy link
Member

vstinner commented Nov 1, 2023

If you want (at least) 128 bytes, why not calling peek(128)?

@marcelm
Copy link
Author

marcelm commented Nov 1, 2023

[...] you could return a smallish harcoded number of bytes, such as, I don't know, 128.

Perhaps returning just a single byte would be best.

This would avoid giving the wrong impression that always 128 bytes are available when someone starts out with BytesIO.peek(0) and later switches to BufferedReader.peek(0). The latter can also return a single byte when the current position happens to be just before the end of the buffer.

I am starting to think that it would be good to let BytesIO.peek() ignore the size argument and just have it return the next byte from the current position onwards if not at EOF and b"" otherwise.

@pitrou
Copy link
Member

pitrou commented Nov 1, 2023

Perhaps returning just a single byte would be best.

peek() nominally returns whatever is available in the internal buffer. The problem, as you mentioned, is that for BytesIO this would entail copying the entire buffer remains. Which is why I suggest a useful but small size.

@pitrou
Copy link
Member

pitrou commented Nov 1, 2023

Really, it helps to think about actual uses of peek(), rather than theoretical arguments. The only example in the stdlib is here:

cpython/Lib/_pyio.py

Lines 523 to 531 in 937872e

if hasattr(self, "peek"):
def nreadahead():
readahead = self.peek(1)
if not readahead:
return 1
n = (readahead.find(b"\n") + 1) or len(readahead)
if size >= 0:
n = min(n, size)
return n

Reading ahead by 1 byte is completely pointless, which is why I suggest a small but non-trivial size.

@marcelm
Copy link
Author

marcelm commented Nov 1, 2023

It’s not completely pointless: My use case is actually reading exactly one byte ahead. But yes, I got lucky because that single byte is sufficient to distinguish the two file formats.

There is also basic_stream::peek in C++, which peeks one character ahead. And the C function ungetc guarantees at most one pushback, which is equivalent. I am guessing these functions inspired BufferedReader.peek, which is possibly why the guarantees are the way they are.

That said, I’m totally fine with returning 128 bytes for peek(0) if both of you agree.

@vstinner
Copy link
Member

vstinner commented Nov 3, 2023

Well, people calling peek(0) currently may rely on the fact that it returns non-zero bytes. I agree with not copying the full buffer remainder, but you could return a smallish harcoded number of bytes, such as, I don't know, 128.

128 sounds arbitrary. The io module has one constant: io.DEFAULT_BUFFER_SIZE. It's used by BufferedReader.peek().

@pitrou
Copy link
Member

pitrou commented Nov 3, 2023

128 sounds arbitrary. The io module has one constant: io.DEFAULT_BUFFER_SIZE.

128 is arbitrary, and so is DEFAULT_BUFFER_SIZE. But the two constants have different goals.

@marcelm marcelm mannequin mentioned this pull request Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting merge topic-IO type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

io.BytesIO does not have peek()
10 participants