-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standalone 9/N: Implement and test Sink and SinkCreator types #301
Conversation
…andalone-sequence-4
…andalone-sequence-4b
…sc compression parameters.
689f1fe
to
87b318a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that much of this code has already undergone review, I haven't delved into every file's details. However, one aspect that caught my attention—which I believe we should address—is the S3 sink's destructor being responsible for finalizing the upload. This approach could potentially cause problems down the line, so it might be wise to modify it before we merge this PR.
src/streaming/file.sink.cpp
Outdated
bool | ||
zarr::FileSink::write(size_t offset, const uint8_t* data, size_t bytes_of_buf) | ||
{ | ||
EXPECT(data, "Null pointer: data"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest moving this check under the if statement. If the number of bytes is zero, we'd expect the data to be null.
|
||
file_.seekp(offset); | ||
file_.write(reinterpret_cast<const char*>(data), bytes_of_buf); | ||
file_.flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling flush()
after every write could introduce significant performance overhead, especially if the file is being written to frequently, as it forces a disk write operation every time. If this function is called multiple times, I suggest flushing when writing is complete.
src/streaming/file.sink.hh
Outdated
|
||
#include "sink.hh" | ||
|
||
#include <memory> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this header inclusion necessary?
src/streaming/file.sink.hh
Outdated
public: | ||
explicit FileSink(std::string_view filename); | ||
|
||
bool write(size_t offset, const uint8_t* data, size_t bytes_of_buf) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using std::span
instead of a pointer to the data and a separate size parameter if it’s possible. If this is applicable, you can modify other instances in the file that follow the same pattern.
public: | ||
S3Sink(std::string_view bucket_name, | ||
std::string_view object_key, | ||
std::shared_ptr<S3ConnectionPool> connection_pool); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you only need access to the pool and can ensure the pool object remains alive as long as necessary, consider passing it using a raw pointer to avoid reference counting overhead. While I'm unsure of the intended use here—a shared pointer might be necessary for managing the object's lifetime—I want to emphasize that using raw pointers as function parameters isn't inherently wrong. In fact, in some cases, it should be encouraged.
src/streaming/s3.sink.cpp
Outdated
} | ||
|
||
bool | ||
zarr::S3Sink::write(size_t offset, const uint8_t* data, size_t bytes_of_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe hiding the distinction between single and multipart uploads is not a good design decision. These are two different operations that users should be aware of.
- Why not always use multipart upload and expect users to call finalize when they're done?
- If object upload is needed, I'd create two "writing" functions: one for object upload that uploads data immediately (throwing an error if it exceeds the part size), and another for writing multi-parts with the expectation that users will finalize the upload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't always use multipart upload because there's a minimum size (5 MiB) for all but the last part in a multipart series. Our metadata is much smaller than that. The design is intended to treat S3 objects the same way we treat files on the filesystem, to the extent that that's possible. There are two very important distinctions:
- We can't seek arbitrarily in an S3 object. This is usually fine, as data is always getting written contiguously.
- Once an S3 object is closed, we can't reopen it and append to it. So if we were to expose
finalize()
to users, then we need to invalidate a Sink after it's been finalized, i.e., disallow users from writing to it.
src/streaming/s3.sink.cpp
Outdated
|
||
zarr::S3Sink::~S3Sink() | ||
{ | ||
if (!is_multipart_upload_() && nbytes_buffered_ > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that having the destructor “finalize” the upload is actively bad:
- Using a destructor to finalize an operation can obscure the developer's intent. It's not obvious that the object destruction is performing an important operation like completing an upload. Typically, destructors are used for resource cleanup, not business logic.
- Destructors are called automatically when an object goes out of scope, but the exact timing of when that happens might not be obvious to users of the class. This can lead to unintended behavior if the user assumes that the finalization happens immediately when they're done uploading data.
- It's hard to handle errors effectively inside a destructor. If the finalize process fails (e.g., network issues), you can't easily throw exceptions or return an error code from a destructor in C++.
- If the object is destroyed due to an exception, you might be in a partially constructed state, and the destructor could be invoked without a proper finalization.
Finalizing an upload is an important operation, and the user of the class should be aware of when it happens. Providing an explicit finalize function ensures that the user consciously decides when to complete the upload, which is a more predictable and transparent approach.
You should avoid logging in the destructor too. Destructors are automatically called when an object goes out of scope or is explicitly deleted. This might happen during normal operation, but it can also occur during stack unwinding. If your destructor logs messages, it might run in contexts where logging infrastructure is no longer available or partially destroyed.
src/streaming/s3.sink.cpp
Outdated
|
||
retval = true; | ||
} catch (const std::exception& exc) { | ||
LOG_ERROR("Error: %s", exc.what()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the log error "silent"? In other words, does LOG_ERROR
allow execution to continue to the "cleanup" code without interruption? Given that this function isn't called explicitly, is this a good idea? Also, should we increment the bytes flushed and reset the bytes buffered even in the event of a failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. We need to at least return the connection, so I'll move
nbytes_flushed_ = nbytes_buffered_;
nbytes_buffered_ = 0;
into the try
block.
src/streaming/s3.sink.cpp
Outdated
bool | ||
zarr::S3Sink::finalize_multipart_upload_() | ||
{ | ||
if (!is_multipart_upload_()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging a warning here seems appropriate. It's unclear why this function would be called without an active multipart upload.
bool retval = connection->complete_multipart_object( | ||
bucket_name_, object_key_, upload_id_, parts_); | ||
|
||
connection_pool_->return_connection(std::move(connection)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe a safer design would implement reference counting for pool connections. This way, when a connection goes out of scope, we could automatically return it to the pool without requiring the user of the API to manually return it each time.
src/streaming/sink.creator.hh
Outdated
namespace zarr { | ||
using Dimension = ZarrDimension_s; | ||
|
||
class SinkCreator final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I previously raised this concern (and I think it was in the context of the SinkCreator
too), but I believe we should avoid using final
. If you disagree, let me know why. Generally, this is strongly discouraged unless there's a very good reason to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Depends on #300.