Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standalone 9/N: Implement and test Sink and SinkCreator types #301

Merged
merged 64 commits into from
Oct 2, 2024

Conversation

aliddell
Copy link
Member

Depends on #300.

@aliddell aliddell force-pushed the standalone-sequence-9 branch from 689f1fe to 87b318a Compare September 20, 2024 21:05
Copy link
Collaborator

@shlomnissan shlomnissan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that much of this code has already undergone review, I haven't delved into every file's details. However, one aspect that caught my attention—which I believe we should address—is the S3 sink's destructor being responsible for finalizing the upload. This approach could potentially cause problems down the line, so it might be wise to modify it before we merge this PR.

bool
zarr::FileSink::write(size_t offset, const uint8_t* data, size_t bytes_of_buf)
{
EXPECT(data, "Null pointer: data");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving this check under the if statement. If the number of bytes is zero, we'd expect the data to be null.


file_.seekp(offset);
file_.write(reinterpret_cast<const char*>(data), bytes_of_buf);
file_.flush();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling flush() after every write could introduce significant performance overhead, especially if the file is being written to frequently, as it forces a disk write operation every time. If this function is called multiple times, I suggest flushing when writing is complete.


#include "sink.hh"

#include <memory>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this header inclusion necessary?

public:
explicit FileSink(std::string_view filename);

bool write(size_t offset, const uint8_t* data, size_t bytes_of_buf) override;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using std::span instead of a pointer to the data and a separate size parameter if it’s possible. If this is applicable, you can modify other instances in the file that follow the same pattern.

public:
S3Sink(std::string_view bucket_name,
std::string_view object_key,
std::shared_ptr<S3ConnectionPool> connection_pool);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you only need access to the pool and can ensure the pool object remains alive as long as necessary, consider passing it using a raw pointer to avoid reference counting overhead. While I'm unsure of the intended use here—a shared pointer might be necessary for managing the object's lifetime—I want to emphasize that using raw pointers as function parameters isn't inherently wrong. In fact, in some cases, it should be encouraged.

}

bool
zarr::S3Sink::write(size_t offset, const uint8_t* data, size_t bytes_of_data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe hiding the distinction between single and multipart uploads is not a good design decision. These are two different operations that users should be aware of.

  • Why not always use multipart upload and expect users to call finalize when they're done?
  • If object upload is needed, I'd create two "writing" functions: one for object upload that uploads data immediately (throwing an error if it exceeds the part size), and another for writing multi-parts with the expectation that users will finalize the upload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't always use multipart upload because there's a minimum size (5 MiB) for all but the last part in a multipart series. Our metadata is much smaller than that. The design is intended to treat S3 objects the same way we treat files on the filesystem, to the extent that that's possible. There are two very important distinctions:

  1. We can't seek arbitrarily in an S3 object. This is usually fine, as data is always getting written contiguously.
  2. Once an S3 object is closed, we can't reopen it and append to it. So if we were to expose finalize() to users, then we need to invalidate a Sink after it's been finalized, i.e., disallow users from writing to it.


zarr::S3Sink::~S3Sink()
{
if (!is_multipart_upload_() && nbytes_buffered_ > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that having the destructor “finalize” the upload is actively bad:

  • Using a destructor to finalize an operation can obscure the developer's intent. It's not obvious that the object destruction is performing an important operation like completing an upload. Typically, destructors are used for resource cleanup, not business logic.
  • Destructors are called automatically when an object goes out of scope, but the exact timing of when that happens might not be obvious to users of the class. This can lead to unintended behavior if the user assumes that the finalization happens immediately when they're done uploading data.
  • It's hard to handle errors effectively inside a destructor. If the finalize process fails (e.g., network issues), you can't easily throw exceptions or return an error code from a destructor in C++.
  • If the object is destroyed due to an exception, you might be in a partially constructed state, and the destructor could be invoked without a proper finalization.

Finalizing an upload is an important operation, and the user of the class should be aware of when it happens. Providing an explicit finalize function ensures that the user consciously decides when to complete the upload, which is a more predictable and transparent approach.

You should avoid logging in the destructor too. Destructors are automatically called when an object goes out of scope or is explicitly deleted. This might happen during normal operation, but it can also occur during stack unwinding. If your destructor logs messages, it might run in contexts where logging infrastructure is no longer available or partially destroyed.


retval = true;
} catch (const std::exception& exc) {
LOG_ERROR("Error: %s", exc.what());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the log error "silent"? In other words, does LOG_ERROR allow execution to continue to the "cleanup" code without interruption? Given that this function isn't called explicitly, is this a good idea? Also, should we increment the bytes flushed and reset the bytes buffered even in the event of a failure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. We need to at least return the connection, so I'll move

nbytes_flushed_ = nbytes_buffered_;
nbytes_buffered_ = 0;

into the try block.

bool
zarr::S3Sink::finalize_multipart_upload_()
{
if (!is_multipart_upload_()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging a warning here seems appropriate. It's unclear why this function would be called without an active multipart upload.

bool retval = connection->complete_multipart_object(
bucket_name_, object_key_, upload_id_, parts_);

connection_pool_->return_connection(std::move(connection));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe a safer design would implement reference counting for pool connections. This way, when a connection goes out of scope, we could automatically return it to the pool without requiring the user of the API to manually return it each time.

namespace zarr {
using Dimension = ZarrDimension_s;

class SinkCreator final
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously raised this concern (and I think it was in the context of the SinkCreator too), but I believe we should avoid using final. If you disagree, let me know why. Generally, this is strongly discouraged unless there's a very good reason to use it.

@aliddell aliddell changed the base branch from standalone-sequence-8 to main October 1, 2024 13:30
Copy link
Collaborator

@shlomnissan shlomnissan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@aliddell aliddell merged commit c9746e1 into main Oct 2, 2024
3 checks passed
@aliddell aliddell deleted the standalone-sequence-9 branch October 2, 2024 04:11
@aliddell aliddell restored the standalone-sequence-9 branch October 2, 2024 17:05
aliddell added a commit that referenced this pull request Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants