Standalone 9/N: Implement and test Sink and SinkCreator types #301

aliddell · 2024-09-17T20:46:46Z

Depends on #300.

…andalone-sequence-4

…andalone-sequence-4b

…sc compression parameters.

shlomnissan

Given that much of this code has already undergone review, I haven't delved into every file's details. However, one aspect that caught my attention—which I believe we should address—is the S3 sink's destructor being responsible for finalizing the upload. This approach could potentially cause problems down the line, so it might be wise to modify it before we merge this PR.

shlomnissan · 2024-09-25T22:28:58Z

src/streaming/file.sink.cpp

+bool
+zarr::FileSink::write(size_t offset, const uint8_t* data, size_t bytes_of_buf)
+{
+    EXPECT(data, "Null pointer: data");


I suggest moving this check under the if statement. If the number of bytes is zero, we'd expect the data to be null.

shlomnissan · 2024-09-25T22:29:04Z

src/streaming/file.sink.cpp

+
+    file_.seekp(offset);
+    file_.write(reinterpret_cast<const char*>(data), bytes_of_buf);
+    file_.flush();


Calling flush() after every write could introduce significant performance overhead, especially if the file is being written to frequently, as it forces a disk write operation every time. If this function is called multiple times, I suggest flushing when writing is complete.

shlomnissan · 2024-09-25T22:30:44Z

src/streaming/file.sink.hh

+
+#include "sink.hh"
+
+#include <memory>


Is this header inclusion necessary?

shlomnissan · 2024-09-26T16:22:18Z

src/streaming/file.sink.hh

+  public:
+    explicit FileSink(std::string_view filename);
+
+    bool write(size_t offset, const uint8_t* data, size_t bytes_of_buf) override;


I suggest using std::span instead of a pointer to the data and a separate size parameter if it’s possible. If this is applicable, you can modify other instances in the file that follow the same pattern.

shlomnissan · 2024-09-26T16:28:54Z

src/streaming/s3.sink.hh

+  public:
+    S3Sink(std::string_view bucket_name,
+           std::string_view object_key,
+           std::shared_ptr<S3ConnectionPool> connection_pool);


If you only need access to the pool and can ensure the pool object remains alive as long as necessary, consider passing it using a raw pointer to avoid reference counting overhead. While I'm unsure of the intended use here—a shared pointer might be necessary for managing the object's lifetime—I want to emphasize that using raw pointers as function parameters isn't inherently wrong. In fact, in some cases, it should be encouraged.

shlomnissan · 2024-09-26T17:12:00Z

src/streaming/s3.sink.cpp

+}
+
+bool
+zarr::S3Sink::write(size_t offset, const uint8_t* data, size_t bytes_of_data)


I believe hiding the distinction between single and multipart uploads is not a good design decision. These are two different operations that users should be aware of.

Why not always use multipart upload and expect users to call finalize when they're done?

If object upload is needed, I'd create two "writing" functions: one for object upload that uploads data immediately (throwing an error if it exceeds the part size), and another for writing multi-parts with the expectation that users will finalize the upload.

We can't always use multipart upload because there's a minimum size (5 MiB) for all but the last part in a multipart series. Our metadata is much smaller than that. The design is intended to treat S3 objects the same way we treat files on the filesystem, to the extent that that's possible. There are two very important distinctions:

We can't seek arbitrarily in an S3 object. This is usually fine, as data is always getting written contiguously.

Once an S3 object is closed, we can't reopen it and append to it. So if we were to expose finalize() to users, then we need to invalidate a Sink after it's been finalized, i.e., disallow users from writing to it.

shlomnissan · 2024-09-27T14:38:20Z

src/streaming/s3.sink.cpp

+
+zarr::S3Sink::~S3Sink()
+{
+    if (!is_multipart_upload_() && nbytes_buffered_ > 0) {


I think that having the destructor “finalize” the upload is actively bad:

Using a destructor to finalize an operation can obscure the developer's intent. It's not obvious that the object destruction is performing an important operation like completing an upload. Typically, destructors are used for resource cleanup, not business logic.

Destructors are called automatically when an object goes out of scope, but the exact timing of when that happens might not be obvious to users of the class. This can lead to unintended behavior if the user assumes that the finalization happens immediately when they're done uploading data.

It's hard to handle errors effectively inside a destructor. If the finalize process fails (e.g., network issues), you can't easily throw exceptions or return an error code from a destructor in C++.

If the object is destroyed due to an exception, you might be in a partially constructed state, and the destructor could be invoked without a proper finalization.

Finalizing an upload is an important operation, and the user of the class should be aware of when it happens. Providing an explicit finalize function ensures that the user consciously decides when to complete the upload, which is a more predictable and transparent approach.

You should avoid logging in the destructor too. Destructors are automatically called when an object goes out of scope or is explicitly deleted. This might happen during normal operation, but it can also occur during stack unwinding. If your destructor logs messages, it might run in contexts where logging infrastructure is no longer available or partially destroyed.

shlomnissan · 2024-09-27T14:49:54Z

src/streaming/s3.sink.cpp

+
+        retval = true;
+    } catch (const std::exception& exc) {
+        LOG_ERROR("Error: %s", exc.what());


Is the log error "silent"? In other words, does LOG_ERROR allow execution to continue to the "cleanup" code without interruption? Given that this function isn't called explicitly, is this a good idea? Also, should we increment the bytes flushed and reset the bytes buffered even in the event of a failure?

I see your point. We need to at least return the connection, so I'll move

nbytes_flushed_ = nbytes_buffered_; nbytes_buffered_ = 0;

into the try block.

shlomnissan · 2024-09-27T14:51:00Z

src/streaming/s3.sink.cpp

+bool
+zarr::S3Sink::finalize_multipart_upload_()
+{
+    if (!is_multipart_upload_()) {


Logging a warning here seems appropriate. It's unclear why this function would be called without an active multipart upload.

shlomnissan · 2024-09-27T14:53:57Z

src/streaming/s3.sink.cpp

+    bool retval = connection->complete_multipart_object(
+      bucket_name_, object_key_, upload_id_, parts_);
+
+    connection_pool_->return_connection(std::move(connection));


I believe a safer design would implement reference counting for pool connections. This way, when a connection goes out of scope, we could automatically return it to the pool without requiring the user of the API to manually return it each time.

shlomnissan · 2024-09-27T15:05:35Z

src/streaming/sink.creator.hh

+namespace zarr {
+using Dimension = ZarrDimension_s;
+
+class SinkCreator final


I previously raised this concern (and I think it was in the context of the SinkCreator too), but I believe we should avoid using final. If you disagree, let me know why. Generally, this is strongly discouraged unless there's a very good reason to use it.

shlomnissan

👍

Depends on #301.

aliddell added 21 commits September 17, 2024 09:59

Move driver source files and tests to a separate directory.

c7832d7

Define the Zarr streaming API.

a811e22

Define the Zarr logger.

cf89e13

Rename zarr.h to acquire.zarr.h.

fff86c6

Merge branch 'standalone-sequence-3' into standalone-sequence-4

61a53bc

Instantiate the logger's mutex.

bdd0c51

Implement and test stream settings, with API setters and getters.

d23fe1a

Wrap C API functions in extern "C" {}

3cd2bb3

Document the StreamSettings getters.

d099795

Merge remote-tracking branch 'upstream/standalone-sequence-3' into st…

cfd412e

…andalone-sequence-4

Merge remote-tracking branch 'upstream/standalone-sequence-4' into st…

4e3b470

…andalone-sequence-4b

call it 'type'

da98e57

Merge branch 'standalone-sequence-3' into standalone-sequence-4

aa3489e

Merge branch 'standalone-sequence-3' into standalone-sequence-4b

c3e419b

No need to double CHECK the settings pointer.

fdf5b08

Implement ZarrStream_s.

cbbc87d

Test ZarrStream_s.

0f28cfc

Implement the rest of the Zarr API functions.

1ca2a1d

Implement and test Zarr common functions.

d370e17

Implement and test ThreadPool, S3Connection{Pool}. Also implement Blo…

caf295b

…sc compression parameters.

Implement and test Sink and SinkCreator types.

e9102dc

aliddell requested a review from shlomnissan September 17, 2024 20:46

aliddell mentioned this pull request Sep 17, 2024

Standalone 10/N: Implement and test base ArrayWriter #302

Merged

aliddell added 7 commits September 18, 2024 11:34

Document that ZarrStream_append will block for compression and flushing

996d22e

Merge branch 'standalone-sequence-3' into standalone-sequence-4

d6ea43e

Merge branch 'standalone-sequence-4' into standalone-sequence-4b

46eb93d

Merge branch 'standalone-sequence-4b' into standalone-sequence-5

95973d4

Merge branch 'standalone-sequence-5' into standalone-sequence-6

a8b30e1

Merge branch 'standalone-sequence-6' into standalone-sequence-7

51f820d

Merge branch 'standalone-sequence-7' into standalone-sequence-8

57bc287

aliddell added 17 commits September 20, 2024 10:52

wip

2853ad1

Reorder and document settings fields

25bfb8b

Merge branch 'standalone-sequence-3' into standalone-sequence-5

e65f889

wip

b7aeda7

Remove version specifier from ZarrStream_create.

e3e6e3f

Merge branch 'standalone-sequence-3' into standalone-sequence-5

44fe333

wip

401231b

Document the settings struct a bit.

d15f431

Merge branch 'standalone-sequence-3' into standalone-sequence-5

10a6318

wip

e6ad827

Fix up some parameters.

59bea6c

Merge branch 'standalone-sequence-3' into standalone-sequence-5

e984c90

Update ZarrStream implementation to use settings struct.

3a0b63a

Remove some redundant code

68dcb03

Merge branch 'standalone-sequence-5' into standalone-sequence-7

78f9344

Merge branch 'standalone-sequence-7' into standalone-sequence-8

3c265da

Merge branch 'standalone-sequence-8' into standalone-sequence-9

87b318a

aliddell force-pushed the standalone-sequence-9 branch from 689f1fe to 87b318a Compare September 20, 2024 21:05

shlomnissan requested changes Sep 27, 2024

View reviewed changes

shlomnissan reviewed Sep 27, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into standalone-sequence-9

f342aeb

aliddell changed the base branch from standalone-sequence-8 to main October 1, 2024 13:30

aliddell added 3 commits October 1, 2024 09:42

Merge remote-tracking branch 'upstream/main' into standalone-sequence-9

417ed1d

Fix merge consequences

8d670db

Respond to PR comments

fa77d42

shlomnissan approved these changes Oct 1, 2024

View reviewed changes

aliddell merged commit c9746e1 into main Oct 2, 2024
3 checks passed

aliddell deleted the standalone-sequence-9 branch October 2, 2024 04:11

aliddell restored the standalone-sequence-9 branch October 2, 2024 17:05

aliddell added a commit that referenced this pull request Oct 2, 2024

Standalone 10/N: Implement and test base ArrayWriter (#302)

97cc823

Depends on #301.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone 9/N: Implement and test Sink and SinkCreator types #301

Standalone 9/N: Implement and test Sink and SinkCreator types #301

aliddell commented Sep 17, 2024

shlomnissan left a comment

shlomnissan Sep 25, 2024

shlomnissan Sep 25, 2024

shlomnissan Sep 25, 2024

shlomnissan Sep 26, 2024

shlomnissan Sep 26, 2024

shlomnissan Sep 26, 2024

aliddell Oct 1, 2024

shlomnissan Sep 27, 2024

shlomnissan Sep 27, 2024

aliddell Oct 1, 2024

shlomnissan Sep 27, 2024

shlomnissan Sep 27, 2024

shlomnissan Sep 27, 2024

shlomnissan left a comment


		#include "sink.hh"

		#include <memory>

Standalone 9/N: Implement and test Sink and SinkCreator types #301

Standalone 9/N: Implement and test Sink and SinkCreator types #301

Conversation

aliddell commented Sep 17, 2024

shlomnissan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shlomnissan left a comment

Choose a reason for hiding this comment