-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(core-clp): Add BoundedReader
to prevent out-of-bound reads in segmented input streams.
#624
Conversation
…an underlying stream
WalkthroughThe pull request introduces changes to the CLP project by adding a new class, Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
components/core/src/clp/CheckpointReader.cpp (1)
10-12
: Clarify condition by using equality checkSince
m_cur_pos
is set to be at mostm_checkpoint
in line 5, the conditionm_cur_pos >= m_checkpoint
will only be true whenm_cur_pos == m_checkpoint
. For clarity, consider changing the condition to:if (m_cur_pos == m_checkpoint) { return ErrorCode_EndOfFile; }components/core/tests/test-CheckpointReader.cpp (1)
9-94
: Add tests for nullm_reader
scenariosCurrently, there are no tests covering the case where
CheckpointReader
is constructed with anullptr
form_reader
. Adding such tests would enhance coverage and ensure the class handles this scenario gracefully.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
components/core/CMakeLists.txt
(2 hunks)components/core/src/clp/CheckpointReader.cpp
(1 hunks)components/core/src/clp/CheckpointReader.hpp
(1 hunks)components/core/src/clp/StringReader.cpp
(1 hunks)components/core/tests/test-CheckpointReader.cpp
(1 hunks)
🔇 Additional comments (4)
components/core/src/clp/CheckpointReader.hpp (1)
16-74
: Overall class implementation is correct
The CheckpointReader
class correctly implements the necessary methods from ReaderInterface
and enforces the checkpoint limit as intended.
components/core/src/clp/StringReader.cpp (1)
44-47
: Enhanced error handling for seeking beyond the end
The added condition in try_seek_from_begin
properly handles attempts to seek beyond the end of the input string. It sets the position to the end of the string and returns ErrorCode_EndOfFile
, preventing undefined behaviour.
components/core/tests/test-CheckpointReader.cpp (1)
9-94
: Unit tests comprehensively cover CheckpointReader
functionality
The test cases thoroughly validate the behaviour of CheckpointReader
, including reading and seeking operations relative to the checkpoint and the end of the underlying stream.
components/core/CMakeLists.txt (1)
356-357
: New files correctly added to the build configuration
The source files CheckpointReader.cpp
, CheckpointReader.hpp
, and the test file test-CheckpointReader.cpp
have been appropriately included in the CMakeLists.txt
, ensuring they are part of the build and test processes.
Also applies to: 554-554
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked with Devin offline and there is a concrete example that explains the usecase of this class. Devin, please add it to the PR when you get a chance.
In general the PR makes sense to me. While there might be a more elegant design to achieve the same target, I feel we can go with the design in this PR given
1.the class design is simple and straight forward
2. We have a rather tight deadline , including other upcoming changes.
Left a few comments, and also we can rename the class to be BoundedReader. Unless @kirkrodrigues has other naming suggestions
Note this class has different purpose from BufferedFileReader. The BufferedFileReader is designed for user who knows a specific pos they will need to seek back, but this class is designed for user who knows a specific pos that they don't want to seek beyond.
@@ -41,6 +41,10 @@ ErrorCode StringReader::try_read(char* buf, size_t num_bytes_to_read, size_t& nu | |||
} | |||
|
|||
ErrorCode StringReader::try_seek_from_begin(size_t pos) { | |||
if (pos > input_string.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only for supporting a new test case?
What you intend to do makes sense to me, but It's bit annoying that the behavior of standard seek interface allows seeking beyond the file ending position, so I feel we need some justification when we decide to change the behavior.
@kirkrodrigues any comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would classify this as a bug I'm fixing in the StringReader class so that I can use it for tests. Every other reader we have will EOF if you seek past the end from what I've seen.
Actually, the way the rest of this class is written if you first seek past the end of the input buffer it will happily let you read data beyond the end of the buffer. I.e. it is very explicitly a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, The FileReader internally calls fseeko, which I believe would allow seeking beyond the end of file.
BufferedFileReader and BufferReader would return ErrorCode_Truncated, and won't update the pos at all if the pos is greater than the maximum length,
From the consistency point of view, maybe we should let it return ErrorCode_Truncated. But also wonder if the rest of your code is dependent on the current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The implementations of ReaderInterface have inconsistent behaviour.
- Some (e.g.,
FileReader
) will rely on the lower implementation to return an error if the seek fails. - Some (e.g.
BufferReader
) will return an error if the seek is past the end, but they won't modifym_pos
. - Some (e.g.,
NetworkReader
) will return an error if the seek is past the end, and they will modifym_pos
. - Each of the above returns different error codes.
- Some (e.g.,
I think the practical implementation is for try_seek_from_begin
to:
- try to seek until
pos
- if that's past the end of the medium (file/buffer/etc.),
m_pos
should be updated to just past the last byte. - the method should return
ErrorCode_EndOfFile
.
The reason to update m_pos
even though we get to pos
is because for some implementations like FileReader, we can't easily check what the last byte is until we seek, and if we seek up to the end of the file, we may not be able to seek backwards to the original m_pos
.
If y'all agree, we should open a GH issue to refactor the existing implementations. And for this implementation, we implement the proposal above (basically what Devin's implemented---although on error, we're still updating m_cur_pos
, even tough the error may not be EOF?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, the standard behavior sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me.
Also went and updated BoundedReader to only update m_curr_pos
on error if that error is EOF.
I'm fine with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (4)
components/core/src/clp/BoundedReader.hpp (2)
22-28
: Document Exceptions Thrown in ConstructorIn the constructor, exceptions are thrown if
m_reader
isnullptr
or ifm_curr_pos
exceedsm_bound
. It's advisable to document these exceptions clearly in the class interface to inform users of potential exceptions during object instantiation.
38-38
: Modifytry_get_pos
to Reflect Bounded PositionCurrently,
try_get_pos
delegates directly tom_reader->try_get_pos(pos)
. Consider modifying it to returnm_curr_pos
instead, ensuring that users receive the position within the bounded context, adhering to the boundary constraints enforced byBoundedReader
.Apply this diff to adjust the method:
-auto try_get_pos(size_t& pos) -> ErrorCode override { return m_reader->try_get_pos(pos); } +auto try_get_pos(size_t& pos) -> ErrorCode override { + pos = m_curr_pos; + return ErrorCode_Success; +}components/core/tests/test-BoundedReader.cpp (2)
22-33
: Clarify Test Section Name for AccuracyThe test section named "BoundedReader does not allow reads beyond end of underlying stream." may be misleading. The test actually verifies that reads are limited to the available data without causing errors when attempting to read beyond the stream's end. Consider renaming the section for clarity.
Apply this change:
-SECTION("BoundedReader does not allow reads beyond end of underlying stream.") { +SECTION("BoundedReader limits reads to available data when reading beyond stream end.") {
76-86
: Remove Unused Variables in Test SectionIn the test section "BoundedReader does not allow seeks beyond checkpoint.", the variables
buf
andnum_bytes_read
are declared but not used. Removing these unused variables will clean up the test code.Apply this diff:
- char buf[cTestStringLen]; - size_t num_bytes_read{};
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
components/core/CMakeLists.txt
(2 hunks)components/core/src/clp/BoundedReader.cpp
(1 hunks)components/core/src/clp/BoundedReader.hpp
(1 hunks)components/core/tests/test-BoundedReader.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (3)
components/core/tests/test-BoundedReader.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/BoundedReader.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/BoundedReader.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🔇 Additional comments (2)
components/core/src/clp/BoundedReader.cpp (2)
4-15
: Verify Error Handling in try_seek_from_begin
Method
The method try_seek_from_begin
adjusts next_pos
based on m_bound
and handles errors from the underlying reader. However, when ErrorCode_EndOfFile
is returned from m_reader->try_seek_from_begin
, m_curr_pos
is updated to next_pos
. Please verify that this behaviour correctly reflects the end-of-file condition and does not lead to inconsistencies in m_curr_pos
.
17-38
: Ensure Consistent Handling of Partial Reads in try_read
The try_read
method correctly limits num_bytes_to_read
to prevent reading beyond m_bound
. After the read operation, it handles end-of-file scenarios, especially when ErrorCode_EndOfFile
is returned with num_bytes_read
equal to zero. Please confirm that this logic aligns with the expected behaviour of the underlying reader, particularly in cases of partial reads.
Left another two comments for nit. otherwise the code looks good to me |
Co-authored-by: haiqi96 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. @kirkrodrigues do you want to do another round of review?
@gibber9809 Can you please also open an issue to track the proposed change for commonizing seek interface? |
Added issue #628. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the header + implementation files, didn't go through unit tests yet
auto | ||
try_read(char* buf, size_t num_bytes_to_read, size_t& num_bytes_read) -> ErrorCode override; | ||
|
||
auto try_read_to_delimiter(char delim, bool keep_delimiter, bool append, std::string& str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a doc string to explain why we override the default implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason is really just that BoundedReader can't delegate to a potentially more efficient implementation in the underlying reader (since it won't respect the bounds), and most code really shouldn't use this interface anyway since its a performance trap.
Can add a docstring saying as much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let's add a doc string
auto | ||
try_read(char* buf, size_t num_bytes_to_read, size_t& num_bytes_read) -> ErrorCode override; | ||
|
||
auto try_read_to_delimiter(char delim, bool keep_delimiter, bool append, std::string& str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since parameters are not used, shall we add [[maybe_unused]]
to silence clang-tidy warnings?
* @return ErrorCode_Success on success | ||
* @return ErrorCode_errno on failure | ||
*/ | ||
auto try_get_pos(size_t& pos) -> ErrorCode override { return m_reader->try_get_pos(pos); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add [[nodiscard]]
for any non-void returns
Co-authored-by: Lin Zhihao <[email protected]>
Co-authored-by: Lin Zhihao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
components/core/src/clp/BoundedReader.hpp (3)
12-13
: Correct grammatical error in the class description.In the comment explaining the purpose of
BoundedReader
, there's a missing preposition.Please modify the comment as follows to improve readability:
- * want to prevent a reader for an earlier segment consuming any bytes from a later segment. + * want to prevent a reader for an earlier segment from consuming any bytes from a later segment.
58-59
: Consistent formatting for function declaration.For better readability, place the return type and function name on the same line.
Modify the declaration as follows:
-auto -try_read(char* buf, size_t num_bytes_to_read, size_t& num_bytes_read) -> ErrorCode override; +auto try_read(char* buf, size_t num_bytes_to_read, size_t& num_bytes_read) -> ErrorCode override;
67-69
: Consider using smart pointers for member variables to manage ownership and lifetime.Using a raw pointer for
m_reader
does not enforce lifetime management, potentially leading to dangling references if the underlying reader is destroyed beforeBoundedReader
.Consider using
std::shared_ptr
orstd::unique_ptr
to manage the ownership:- ReaderInterface* m_reader{nullptr}; + std::shared_ptr<ReaderInterface> m_reader;This change would ensure that the underlying reader remains valid for the lifetime of
BoundedReader
.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
components/core/src/clp/BoundedReader.hpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/BoundedReader.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🔇 Additional comments (5)
components/core/src/clp/BoundedReader.hpp (5)
4-4
: Include missing headers for size_t
and ErrorCode
.
The file is missing includes for cstddef
(for size_t
) and the header where ErrorCode
is defined. This could lead to compilation issues.
26-27
: Verify the boundary condition in the constructor.
Currently, the check uses m_curr_pos > m_bound
. Should the condition be m_curr_pos >= m_bound
to prevent the position from being equal to the bound?
Please confirm whether the equality case should be considered an error, ensuring that the reader does not start at the exact boundary position.
38-38
: Add [[nodiscard]]
attribute to non-void return functions.
To prevent unintended ignoring of return codes, consider adding [[nodiscard]]
to the try_get_pos
method.
47-47
: Add [[nodiscard]]
attribute to try_seek_from_begin
.
Since the method returns an ErrorCode
, adding [[nodiscard]]
encourages checking the return value.
61-64
: Add [[maybe_unused]]
to unused parameters to avoid warnings.
The parameters in try_read_to_delimiter
are unused, which may trigger compiler warnings.
Consider updating the method signature:
-auto try_read_to_delimiter(char delim, bool keep_delimiter, bool append, std::string& str)
+auto try_read_to_delimiter([[maybe_unused]] char delim, [[maybe_unused]] bool keep_delimiter, [[maybe_unused]] bool append, [[maybe_unused]] std::string& str)
Alternatively, you can omit parameter names if they are unused:
-auto try_read_to_delimiter(char delim, bool keep_delimiter, bool append, std::string& str)
+auto try_read_to_delimiter(char, bool, bool, std::string&) -> ErrorCode override {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good to me. Sorry that there are still a few suggestions for fixing clang-tidy warnings in the test file. Your IDE/clang-tidy run might also complain about REQUIRE
macro, but we can ignore that for now since we're planning to fix this by upgrading catch2 to a higher version.
#include "../src/clp/StringReader.hpp" | ||
|
||
TEST_CASE("Test Bounded Reader", "[BoundedReader]") { | ||
constexpr char cTestString[]{"0123456789"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use constexpr std::string_view
for const strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using constexpr char[] because StringReader takes std::string const& and I don't want to manually initialize an std::string every time I open a StringReader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This was to follow our guideline here: https://www.notion.so/yscope/WIP-Coding-Guidelines-9a308b847a5343958ba3cb97a850be66?pvs=4#13604e4d9e6b80b09a18d4f71a89f1c8
- From compiler's perspective using
constexpr char[]
andconstexpr std::string_view
won't make a difference: check here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, I've read the coding guidelines, and I understand that both ways of doing it are functionally the same. What I'm saying is that I don't like the explicit std::string{} initialization that you have to do when passing string_view in this specific case since it wastes horizontal space and reads worse.
I'll make the change to avoid wasting time, but I think it makes the code less readable.
clp::StringReader string_reader; | ||
string_reader.open(cTestString); | ||
clp::BoundedReader bounded_reader{&string_reader, cTestStringLen + 1}; | ||
char buf[cTestStringLen + 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using std::array
to:
- Silence clang-tidy warnings
- Enforce initialization on the allocated array memory
Interesting. Yeah I didn't get any clang-tidy warnings (besides some incorrect ones) running from the command line likely because it was getting confused about the catch2 macro expansions. I'm guessing clion quietly does a lot of extra configuration to deal with this sort of thing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PR title, how about:
feat(core-clp): Add `BoundedReader` to prevent out-of-bound reads in segmented input streams.
BoundedReader
to prevent out-of-bound reads in segmented input streams.
Description
This PR adds a BoundedReader class that can help avoid backwards seeks when reading input streams segmented into several logical chunks. The BoundedReader is a ReaderInterface that prevents reading or seeking beyond a certain "bound" byte offset.
This is used in a follow-up PR to ensure that readers for different parts of a single-file-archive being streamed over the network do not accidentally read past the end of their section (this happens frequently with readers that buffer input beyond what the user has requested such as ZstdDecompressor).
For example consider an input stream divided into the following logical chunks:
| zstd stream 1 | header bytes | zstd stream 2|
If a ZstdDecompressor reading zstd stream 1 directly wraps that input stream it will almost certainly end up consuming the header bytes and some parts of zstd stream 2 while populating its internal buffer. In fact this sort of speculative buffering is required if we want to ZstdDecompressor to be performant. As a result, reading the header bytes and decompressing zstd stream 2 requires first seeking backwards in the original input stream.
BoundedReader addresses this problem by wrapping the input stream making it so that the ZstdDecompressor is unable to consume any bytes beyond the end of the logical section it belongs to, meaning that the following header and stream sections can always be read without backwards seeks.
This BoundedReader approach has some advantages over approaches that allow backwards seeking by buffering the input stream:
Validation performed
Summary by CodeRabbit
New Features
BoundedReader
class to manage reading limits within input streams.BoundedReader
functionalities, ensuring robust error handling and boundary checks.Bug Fixes
StringReader
class to prevent out-of-bounds seeking.