Reply offload #1457

alexander-shabanov · 2024-12-18T12:47:06Z

Overview

This PR introduces the ability to offload replies to I/O threads as described at #1353.

Key Changes

Added capability to reply construction allowing to interleave regular replies with offloaded replies in client reply buffers
Extended write-to-client handlers to support offloaded replies
Added offloading of bulk replies when reply offload enabled
Minor changes in cluster slots stats in order to support network-bytes-out for offloaded replies
Reply offload is beneficial for performance despite object size only starting certain number of threads. So it will be enabled only starting certain number of threads. Internal configuration min-io-threads-reply-offload-on introduced to manage this number of threads
Reply offload is even more efficient starting certain number of threads. Internal configuration min-io-threads-value-prefetch-off introduced to manage this number of threads

Note: When reply offload disabled content and handling of client reply buffers remains as before this PR

Implementation Details

Reply construction:

Original _addReplyToBuffer and _addReplyProtoToList have been renamed to _addReplyPayloadToBuffer and _addReplyPayloadToList and extended to support different types of payloads - regular replies and offloaded replies.
New _addReplyToBuffer and _addReplyProtoToList calls now _addReplyPayloadToBuffer and _addReplyPayloadToList and used for adding regular replies to client reply buffers.
Newly introduced _addBulkOffloadToBuffer and _addBulkOffloadToList are used for adding offloaded replies to client reply buffers.

Write-to-client infrastructure:

The writevToClient and _postWriteToClient has been significantly changed to support reply offload capability.

Internal configuration:

min-io-threads-reply-offload-on - Minimum number of IO threads for enabling reply offload
min-io-threads-value-prefetch-off - Minimum number of IO threads for disabling value prefetch

Testing

Existing unit and integration tests passed. Reply offload enabled on tests with --io-threads flag
Added unit tests for reply offload functionality

Performance Tests

Note: pay attention io-threads 1 config means only main thread with no additional io-threads, io-threads 2 means main thread plus 1 I/O thread, io-threads 9 means main thread plus 8 I/O threads.

Performance Tests are conducted using:

3,000,000 keys
512 bytes object size
1000 clients

io-threads (including main thread)	No Offload	Reply Offload
7	1,160,000	1,160,000
8	1,150,000	1,280,000
9	1,150,000	1,330,000
10	N/A	1,380,000
11	N/A	1,420,000

madolson · 2024-12-18T16:39:13Z

valkey.conf

+# For use cases where command replies include Bulk strings (e.g. GET, MGET)
+# reply offload can be enabled to eliminate espensive memory access
+# and redundant data copy performed by main thread
+#
+# reply-offload yes


Do we expect their to be cases where tuning this variable makes sense? Generally we want to avoid configuration in Valkey to make it simple to operate. Can we make real-time decisions about offloading?

I'd prefer to avoid the config too. It's better to start with no config and, if it turns out we need it later, then we can add. The reverse is not possible because removing a config is a breaking change.

Please see results of Performance tests in the PR description. Reply offload benefits performance if either data size is large (e.g. 64 Kb) or number of I/O threads is big enough for small data sizes (e.g. starting 9 io-threads config for 512 byte).

As eliminating expensive memory access to obj->ptr by main thread is major component of reply offload optimization , it is very challenging to provide dynamic solution. Note: access to obj->ptr is required to know object size. Besides this, assuming object size is available somehow, it will be relatively challenging to calibrate IsReplyOffloadBeneficial(data_size, io_threads_num) to make it generic for any OS/CPU architecture/etc.

It looks like we should provide config parameter in this case and customer need to test their workloads with and without reply offload and decide to activate it or not.

Please share your opinions.

We have other places where we dynamically choose to take an action, like with lazy-free effort and freeing objects in background threads. For lazy-free, we try to guess how much work it will be to free, and if the effort is small we do it in the main thread anyways. It seems like we can have a similar heuristic here, if the object is large or we have enough I/O threads, we offload the items to be freed. That way end users don't have to tune it and it works well by default.

madolson · 2024-12-18T16:42:54Z

src/config.c

@@ -3206,6 +3206,7 @@ standardConfig static_configs[] = {
    createBoolConfig("cluster-slot-stats-enabled", NULL, MODIFIABLE_CONFIG, server.cluster_slot_stats_enabled, 0, NULL, NULL),
    createBoolConfig("hide-user-data-from-log", NULL, MODIFIABLE_CONFIG, server.hide_user_data_from_log, 1, NULL, NULL),
    createBoolConfig("import-mode", NULL, DEBUG_CONFIG | MODIFIABLE_CONFIG, server.import_mode, 0, NULL, NULL),
+    createBoolConfig("reply-offload", NULL, MODIFIABLE_CONFIG, server.reply_offload_enabled, 0, NULL, NULL),


I guess also why is the default off? IO threading is off by default, so it seems to allow this to be on by default.

Please comments above regarding avoiding reply-offload config at all. The answer to the question 'why reply-offload is off by default' is because it does not benefit performance in 100% of use cases.

src/networking.c

madolson · 2024-12-18T17:07:49Z

https://github.com/valkey-io/valkey/actions/runs/12395947567/job/34606854911?pr=1457

Means you are leaking some memory.

madolson

Not a super comprehensive review. Mostly just some comments to improve the clarity, since the code is complex but seems mostly reasonable.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

I didn't follow the second half of this sentence. Do you mean "unless decrease in cob size is important"? I find that unlikely to be the case. I would also still like to understand better why it degrades performance.

src/networking.c

madolson · 2024-12-18T17:03:12Z

src/networking.c

+        payloadHeader *header = (payloadHeader *)ptr;
+        ptr += sizeof(payloadHeader);
+
+        if (header->type == CLIENT_REPLY_PAYLOAD_BULK_OFFLOAD) {
+            clusterSlotStatsAddNetworkBytesOutForSlot(header->slot, header->actual_len);
+
+            robj** obj_ptr = (robj**)ptr;


Code like this would benefit a lot from some helper methods. Instead of just constantly moving and recasting values. Something like,

robj *getValkeyObjectFromHeader(payloadHeader *header) { char *ptr = (char *ptr) header; ptr += sizeof(payloadHeader); return (robj**)ptr; }

The suggested helper function does not address all the needs. As buffer can contain content like
header1ptr1ptr2ptr3header2plain_replyheader3ptr4ptr5 and it is more convenient to move ptr and objv accordingly.

I didn't understand your comment, it doesn't look like it rendered correctly?

it is more convenient to move

I agree, but it is much harder to read.

zuiderkwast · 2024-12-18T18:23:13Z

I just looked briefly, mostly at the discussions.

I don't think we should call this feature "reply offload". The design is not strictly limited to offloading to threads. It's rather about avoiding copying.

The TPS for GET commands with data size 512 byte increased from 1.09 million to 1.33 million requests per second in test with 1000 clients and 8 I/O threads.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

So there appears to be some overhead with this approach? It could be that cob memory is already in CPU cache, but when the cob is written to the client, the values are not in CPU cache anymore, so we get more cold memory accesses.

Anyhow, you tested it only with 512 byte values? I would guess this feature is highly dependent on the value size. With a value size of 100MB, I would be surprised if we don't see an improvement also in single-threaded mode.

Is there any size threshold for when we embed object pointers in the cob? Is it as simple as if the value is stored as OBJECT_ENCODING_RAW, the string is stored in this way? In that case, the threshold is basically around 64 bytes practice, because smaller strings are stored as EMBSTR.

I think we should benchmark this feature with several different value sizes and find the reasonable size threshold where we benefit from this. Probably there will be a different (higher) threshold for single-threaded and a lower one for IO-threaded. Could it even depend on the number of threads?

alexander-shabanov · 2024-12-19T12:30:29Z

https://github.com/valkey-io/valkey/actions/runs/12395947567/job/34606854911?pr=1457

Means you are leaking some memory.

Fixed

alexander-shabanov · 2024-12-20T15:35:21Z

Not a super comprehensive review. Mostly just some comments to improve the clarity, since the code is complex but seems mostly reasonable.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

I didn't follow the second half of this sentence. Do you mean "unless decrease in cob size is important"? I find that unlikely to be the case. I would also still like to understand better why it degrades performance.

From the tests and perf profiling it appears that main cause for performance improvement from this feature comes from eliminating expensive memory access to obj->ptr by main thread and much much less from eliminating copy to reply buffers. Without I/O threads, main thread still need to access obj->ptr and writev flow is a bit slower (requires additional preparation work) than plain write flow. I will publish results of various tests with and without I/O threads and with different data sizes on next week.

alexander-shabanov · 2024-12-20T15:41:24Z

I just looked briefly, mostly at the discussions.

I don't think we should call this feature "reply offload". The design is not strictly limited to offloading to threads. It's rather about avoiding copying.

The TPS for GET commands with data size 512 byte increased from 1.09 million to 1.33 million requests per second in test with 1000 clients and 8 I/O threads.
The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

So there appears to be some overhead with this approach? It could be that cob memory is already in CPU cache, but when the cob is written to the client, the values are not in CPU cache anymore, so we get more cold memory accesses.

Anyhow, you tested it only with 512 byte values? I would guess this feature is highly dependent on the value size. With a value size of 100MB, I would be surprised if we don't see an improvement also in single-threaded mode.

Is there any size threshold for when we embed object pointers in the cob? Is it as simple as if the value is stored as OBJECT_ENCODING_RAW, the string is stored in this way? In that case, the threshold is basically around 64 bytes practice, because smaller strings are stored as EMBSTR.

I think we should benchmark this feature with several different value sizes and find the reasonable size threshold where we benefit from this. Probably there will be a different (higher) threshold for single-threaded and a lower one for IO-threaded. Could it even depend on the number of threads?

Very good questions. I will publish results of various tests with and without I/O threads and with different data sizes on next week. IMPORTANT NOTE: we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread (eliminating copy much less important).

zuiderkwast · 2024-12-20T17:17:12Z

we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread

Got it. Thanks!

At least, when the feature is ON, it doesn't make sense to dynamically switch it OFF based on length.

But for single-threaded mode where this feature is normally OFF, we could consider switching it ON dynamically only for really huge strings, right? In this case we will have one expensive memory access, but we could avoid copying megabytes. Let's see the benchmark results if this makes sense.

I appreciate you're testing this with different sizes and with/without IO threading.

alexander-shabanov · 2024-12-23T14:18:48Z

we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread

Got it. Thanks!

At least, when the feature is ON, it doesn't make sense to dynamically switch it OFF based on length.

But for single-threaded mode where this feature is normally OFF, we could consider switching it ON dynamically only for really huge strings, right? In this case we will have one expensive memory access, but we could avoid copying megabytes. Let's see the benchmark results if this makes sense.

I appreciate you're testing this with different sizes and with/without IO threading.

Published results of performance tests in the PR description

src/unit/test_networking.c

uriyage · 2024-12-24T12:14:03Z

src/cluster_slot_stats.c

+
+int clusterSlotStatsEnabled(void) {
+    return server.cluster_slot_stats_enabled && /* Config should be enabled. */
+           server.cluster_enabled;              /* Cluster mode should be enabled. */


These comments appear to be redundant.

These comments retained from the original refactored code. They indeed redundant. Will remove them

src/io_threads.c

uriyage · 2024-12-24T13:12:13Z

src/networking.c

@@ -234,6 +266,9 @@ client *createClient(connection *conn) {
    c->commands_processed = 0;
    c->io_last_reply_block = NULL;
    c->io_last_bufpos = 0;
+    c->io_last_written_buf = NULL;


Can't we instead use a bit flag in the payload header to indicate if we are done? Since the main thread needs to iterate over the headers inside the buffer anyway to get the actual written bytes.

"the main thread needs to iterate over the headers inside the buffer anyway to get the actual written bytes" is not accurate. The main thread iterates over the headers inside the buffer ONLY ONCE when buffer should be released.

"use a bit flag in the payload header" approach will cause:

main thread to iterate over the headers inside the buffer even when buffer can't be released fully yet

writer thread to iterate over the headers inside the buffer in the end of write each time to detect/mark headers

The implemented io_last_written_buf/io_last_written_bufpos/io_last_written_data_len eliminates redundant iterations over headers inside buffers and much simpler.

zuiderkwast · 2024-12-25T11:18:12Z

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.

This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.

We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

alexander-shabanov · 2024-12-26T07:38:45Z

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.

This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.

We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

Pay attention 9 threads means main thread + 8 I/O threads. Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

ranshid · 2025-01-02T14:19:08Z

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.
This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.
We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

Pay attention 9 threads means main thread + 8 I/O threads. Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

@alexander-shabanov I do not think adding a configuration parameter is the preferred option in this case. Users are almost never tuning their caches at these levels and it is also very problematic to tell the user to learn his workload and formulate ahis own rules to when to enable this config. I also think there is some risk in introducing this degradation so we should work to understand what other alternatives we have.
I can think of some:

Enable this feature only when the number of active IO-Threads is 8
Track the CPU consumption on the engine thread and synamically enable the feature when the main engine is utilizing high CPU
Maybe we can find a way to tag the reply object should be offloaded so that we will not get the memory access penalty
And I am sure there are more.

uriyage · 2025-01-02T15:55:54Z

src/unit/test_networking.c

+    /* Test 1:  Add bulk offloads to the reply list */
+
+    /* Fill c->buf almost completely */
+    size_t reply_len = c->buf_usable_size - 2 * sizeof(payloadHeader) - 4;


Could you add a comment explaining why -4 ?

uriyage · 2025-01-02T16:02:42Z

src/unit/test_networking.c

+    freeReplyOffloadClient(c);
+
+    return 0;
+}


Can we test releaseBufOffloads instead of directly calling decrRefCount

uriyage · 2025-01-02T16:16:55Z

src/cluster_slot_stats.c

@@ -174,24 +179,14 @@ void clusterSlotStatsDecrNetworkBytesOutForReplication(long long len) {
 *    This type is not aggregated, to stay consistent with server.stat_net_output_bytes aggregation.
 * This function covers the internal propagation component. */
 void clusterSlotStatsAddNetworkBytesOutForShardedPubSubInternalPropagation(client *c, int slot) {
-    /* For a blocked client, c->slot could be pre-filled.


Not sure I understand why was it required before this PR.

uriyage · 2025-01-02T16:20:25Z

src/cluster_slot_stats.c

-static int canAddNetworkBytesOut(client *c) {
-    return server.cluster_slot_stats_enabled && server.cluster_enabled && c->slot != -1;
+static int canAddNetworkBytesOut(int slot) {
+    return clusterSlotStatsEnabled() && slot != -1;


Consider renaming it to canAddSlotStats and using it in both canAddCpuDuration and canAddNetworkBytesIn

applied suggestion

uriyage · 2025-01-02T16:24:54Z

src/io_threads.c

-        c->io_last_bufpos = ((clientReplyBlock *)listNodeValue(c->io_last_reply_block))->used;
+        clientReplyBlock *block = (clientReplyBlock *)listNodeValue(c->io_last_reply_block);
+        c->io_last_bufpos = block->used;
+        /* If reply offload enabled force new header */


Maybe we should indeed check if reply offload is enabled before writing NULL to the header? Checking a global is cheaper than write access to the block

uriyage · 2025-01-02T16:56:40Z

src/networking.c

@@ -374,6 +425,49 @@ void deleteCachedResponseClient(client *recording_client) {
 /* -----------------------------------------------------------------------------
 * Low level functions to add more data to output buffers.
 * -------------------------------------------------------------------------- */
+static inline void insertPayloadHeader(char *buf, size_t *bufpos, uint8_t type, size_t len, int slot, payloadHeader **last_header) {
+    /* Save the latest header */
+    *last_header = (payloadHeader *)(buf + *bufpos);


Maybe new_header is more clear?

applied suggestion

uriyage · 2025-01-02T17:22:20Z

src/networking.c

    c->bufpos += reply_len;
    /* We update the buffer peak after appending the reply to the buffer */
    if (c->buf_peak < (size_t)c->bufpos) c->buf_peak = (size_t)c->bufpos;
    return reply_len;
 }

+size_t _addReplyToBuffer(client *c, const char *s, size_t len) {


Can be static as well. same for _addBulkOffloadToBuffer

applied suggestion

uriyage · 2025-01-02T17:38:48Z

src/networking.c

-    struct iovec *iov = iov_arr;
-    ssize_t bufpos, iov_bytes_len = 0;
-    listNode *lastblock;
+    char prefixes[iovmax / 3 + 1][LONG_STR_SIZE + 3];


iovmax / 3 - use constant instead of magic number

Please add comments for this line to clarify what the prefixes are and why these dimensions were chosen

It would be better to find a way to avoid allocating this array on the stack in the common case where reply offload is disabled

added constants and explaining comments

uriyage · 2025-01-02T17:48:49Z

src/networking.c

+    char prefixes[iovmax / 3 + 1][LONG_STR_SIZE + 3];
+    char crlf[2] = {'\r', '\n'};
+    int bufcnt = 0;
+    bufWriteMetadata metadata[listLength(c->reply) + 1];


Add a comment explaining that the +1 is for the static buffer

added comment

uriyage · 2025-01-02T18:00:23Z

src/networking.c

+}
+
+static inline int updatePayloadHeader(payloadHeader *last_header, uint8_t type, size_t len, int slot) {
+    if (last_header->type == type && last_header->slot == slot) {


When using for example MGET with small values that are not offloaded and come from different slots, wouldn't this cause multiple plain headers, thus requiring writeV instead of a simple write? We should investigate if this causes any performance degradation

In CMD slot will be always equal to -1 (see if (!clusterSlotStatsEnabled(slot)) slot = -1; in upsertPayloadHeader)
In CME MGET keys must map to the same slot.

zuiderkwast · 2025-01-09T14:59:10Z

Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

Just like Ran and Madelyn, I don't want to introduce a config. There are many reasons or that. This is an optimization, not new functionality. Most users don't tweak configs like that. An optimization can change, but a config needs to be maintained and needs backward compatibility. The more configs we add, the harder it is for users to tune the combination of all configs, so we should do our best to find the best behavior automatically.

Second topic: There are three code paths worth considering.

IO threads >= N. To avoid a memory access, we don't want to check the string size. Offload to IO thread.
IO threads not active. Short string reply. The main thread copies the string to reply buffer.
IO threads not active. Long string reply. When the main thread is about to copy the string to the reply buffer, it can see that the length is >= M bytes long (size threshold) and switch to this feature to avoid copying the string.

Case 3 is a very powerful optimization for long strings. Some users store large data in strings.

IO threads disabled is also not a corner case in any way. It is common to run small instances without IO threads and to scale horizontally using more cluster nodes instead of vertically using threads.

Instead of configs, I think we should pick some safe constants for N and M to make sure we don't get a regression in any case. I suggest N = 9 and M = 16K.

alexander-shabanov · 2025-01-13T10:33:17Z

Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

Just like Ran and Madelyn, I don't want to introduce a config. There are many reasons or that. This is an optimization, not new functionality. Most users don't tweak configs like that. An optimization can change, but a config needs to be maintained and needs backward compatibility. The more configs we add, the harder it is for users to tune the combination of all configs, so we should do our best to find the best behavior automatically.

Second topic: There are three code paths worth considering.
1. IO threads >= N. To avoid a memory access, we don't want to check the string size. Offload to IO thread.

2. IO threads not active. Short string reply. The main thread copies the string to reply buffer.

3. IO threads not active. Long string reply. When the main thread is about to copy the string to the reply buffer, it can see that the length is >= M bytes long (size threshold) and switch to this feature to avoid copying the string.
Case 3 is a very powerful optimization for long strings. Some users store large data in strings.

IO threads disabled is also not a corner case in any way. It is common to run small instances without IO threads and to scale horizontally using more cluster nodes instead of vertically using threads.

Instead of configs, I think we should pick some safe constants for N and M to make sure we don't get a regression in any case. I suggest N = 9 and M = 16K.

@zuiderkwast We discussed it and going to propose in this PR to perform reply offload according to static number of I/O threads (i.e. io-threads config). Static number of I/O threads is more preferable than active I/O threads because it makes all the tests and troubleshooting of potential issues much more deterministic. I am working on final code changes and tests.

The reply offload activation according to size of object is deferred to another PR.

The best solution will be to perform reply offload according to actual main thread load matching the condition where reply offload is beneficial. However, it requires much more complex research.

alexander-shabanov · 2025-01-13T10:35:05Z

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.
This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.
We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

Pay attention 9 threads means main thread + 8 I/O threads. Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

@alexander-shabanov I do not think adding a configuration parameter is the preferred option in this case. Users are almost never tuning their caches at these levels and it is also very problematic to tell the user to learn his workload and formulate ahis own rules to when to enable this config. I also think there is some risk in introducing this degradation so we should work to understand what other alternatives we have. I can think of some:
1. Enable this feature only when the number of active IO-Threads is 8

2. Track the CPU consumption on the engine thread and synamically enable the feature when the main engine is utilizing high CPU

3. Maybe we can find a way to tag the reply object should be offloaded so that we will not get the memory access penalty
   And I am sure there are more.

@ranshid please see this reply

ranshid · 2025-01-13T11:25:13Z

Thank you @alexander-shabanov . I think we need to be careful about this PR as this is near the release date. I think it is good we will focus on introducing a simple optimization and later focus on better optimization for dynamic support or large strings when io-threads are disabled.

ranshid · 2025-01-13T18:12:40Z

@alexander-shabanov / @uriyage following the discussion with the maintainers let's try to answer some of the followup questions:

In the performance results we see degradation when using small values which is "mitigated" when the number of io-threads is increased and we get better results when we offload the reply when the number of io-threads is higher than N. I suppose this is explained since the io-threads are not prefetching the values, so when the number of io-thraeds is small they are the bottleneck and thus we get degraded performance while when the number of io-threads is high the bottleneck shifts back to the engine - can you please verify that?
Is does seem the better solution is to make dynamic decision based on the string size. IMO in order to better support it we would need to make use of the alternative solution (tagging the clientReplyBlock). I recall there were some concerns going with this option but would be happy if you can revive some benchmark results and bring more data.

zuiderkwast · 2025-01-13T18:20:27Z

Just an idea: We could use a flag bit in robj to flag that a string value is larger than say 1K. Then, we can check this bit instead of reading the actual size.

ranshid · 2025-01-14T05:50:19Z

Just an idea: We could use a flag bit in robj to flag that a string value is larger than say 1K. Then, we can check this bit instead of reading the actual size.

I agree it is a valid option and I think we already raised it some comments above but I think it is more related to the way we could dynamically enable this feature per object and not handle the degradation we observe for small objects. From the benchmark results you can see that even when there are no io-threads there is a slight degradation even though the engine does not access the string. IMO it is caused by the extra work to manage/access the headers.
I think we can either decide to allow this degradation (2.5% for the non-iothreads case which is not super bad...) or we can change the implementation to the alternative one (using reply block tagging). I think @alexander-shabanov has some insights about the issues with the alternative implementation which I would like to observe in order to take the decision.

alexander-shabanov · 2025-01-14T07:02:09Z

Just an idea: We could use a flag bit in robj to flag that a string value is larger than say 1K. Then, we can check this bit instead of reading the actual size.

This is the intention in "The reply offload activation according to size of object is deferred to another PR."

alexander-shabanov · 2025-01-14T07:23:33Z

@alexander-shabanov / @uriyage following the discussion with the maintainers let's try to answer some of the followup questions:

1. In the performance results we see degradation when using small values which is "mitigated" when the number of io-threads is increased and we get better results when we offload the reply when the number of io-threads is higher than N. I suppose this is explained since the io-threads are not prefetching the values, so when the number of io-thraeds is small they are the bottleneck and thus we get degraded performance while when the number of io-threads is high the bottleneck shifts back to the engine - can you please verify that?

2. Is does seem the better solution is to make dynamic decision based on the string size. IMO in order to better support it we would need to make use of the alternative solution (tagging the clientReplyBlock). I recall there were some concerns going with this option but would be happy if you can revive some benchmark results and bring more data.

@ranshid

We confirmed with profiling tool for small objects (e.g. 512 byte) and small number of I/O threads, I/O threads are bottleneck. The main reason is memory access done by I/O thread. In code published right now in PR, I/O thread accesses:

c->buf
obj
obj->ptr
In upcoming code optimization we eliminated access to obj by I/O thread. Will publish this change soon. This improved performance a bit - reply offload started to be beneficial from a bit lower number of threads.
Note: We tried to apply prefetching on I/O thread side but it did not help. We deferred deeper research of prefetching on I/O thread side.

Solution based on clientReplyBlock has been evaluated in internal POC and found worse than solution proposed in this PR. It does not reduce memory access but require more memory allocations and adds an additional access to clientReplyBlock.

alexander-shabanov · 2025-01-14T07:28:19Z

Just an idea: We could use a flag bit in robj to flag that a string value is larger than say 1K. Then, we can check this bit instead of reading the actual size.

I agree it is a valid option and I think we already raised it some comments above but I think it is more related to the way we could dynamically enable this feature per object and not handle the degradation we observe for small objects. From the benchmark results you can see that even when there are no io-threads there is a slight degradation even though the engine does not access the string. IMO it is caused by the extra work to manage/access the headers. I think we can either decide to allow this degradation (2.5% for the non-iothreads case which is not super bad...) or we can change the implementation to the alternative one (using reply block tagging). I think @alexander-shabanov has some insights about the issues with the alternative implementation which I would like to observe in order to take the decision.

please these replies:
#1457 (comment)
#1457 (comment)

Signed-off-by: Alexander Shabanov <[email protected]>

alexander-shabanov · 2025-01-16T11:21:02Z

@zuiderkwast @ranshid I uploaded new revision addressing main comment regarding reply offload efficiency starting certain number of threads. Updated description with changes in the code + config and performance test numbers

codecov · 2025-01-16T12:50:59Z

Codecov Report

Attention: Patch coverage is 60.06711% with 119 lines in your changes missing coverage. Please review.

Project coverage is 70.70%. Comparing base (921ba19) to head (2293a6a).
Report is 6 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/networking.c	59.62%	109 Missing ⚠️
src/io_threads.c	25.00%	6 Missing ⚠️
src/memory_prefetch.c	0.00%	2 Missing ⚠️
src/replication.c	60.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1457      +/-   ##
============================================
- Coverage     70.98%   70.70%   -0.29%     
============================================
  Files           120      121       +1     
  Lines         65095    65330     +235     
============================================
- Hits          46210    46190      -20     
- Misses        18885    19140     +255

Files with missing lines	Coverage Δ
src/cluster_slot_stats.c	`94.18% <100.00%> (-0.17%)`	⬇️
src/config.c	`78.39% <ø> (ø)`
src/server.h	`100.00% <ø> (ø)`
src/memory_prefetch.c	`3.05% <0.00%> (-0.08%)`	⬇️
src/replication.c	`87.39% <60.00%> (-0.11%)`	⬇️
src/io_threads.c	`7.45% <25.00%> (+0.51%)`	⬆️
src/networking.c	`85.52% <59.62%> (-3.27%)`	⬇️

... and 12 files with indirect coverage changes

Signed-off-by: Alexander Shabanov <[email protected]>

alexander-shabanov mentioned this pull request Dec 18, 2024

[NEW] Reply Offload #1353

Open

alexander-shabanov force-pushed the reply_offload branch from c2d1a60 to a0a156c Compare December 18, 2024 15:18

madolson reviewed Dec 18, 2024

View reviewed changes

src/networking.c Show resolved Hide resolved

madolson reviewed Dec 18, 2024

View reviewed changes

alexander-shabanov force-pushed the reply_offload branch from db824f4 to 04e41c1 Compare December 19, 2024 12:26

alexander-shabanov closed this Dec 19, 2024

alexander-shabanov reopened this Dec 19, 2024

alexander-shabanov force-pushed the reply_offload branch 2 times, most recently from ac7e1f5 to a40e72e Compare December 19, 2024 14:03

alexander-shabanov force-pushed the reply_offload branch from a40e72e to cff89de Compare December 24, 2024 07:25

uriyage reviewed Dec 24, 2024

View reviewed changes

alexander-shabanov force-pushed the reply_offload branch from cff89de to 72be14b Compare December 25, 2024 08:23

uriyage reviewed Jan 2, 2025

View reviewed changes

alexander-shabanov force-pushed the reply_offload branch from 72be14b to 6bf48aa Compare January 16, 2025 08:39

Reply offload

d840115

Signed-off-by: Alexander Shabanov <[email protected]>

alexander-shabanov force-pushed the reply_offload branch from 6bf48aa to cb5e97f Compare January 16, 2025 11:14

addressed PR comments

2293a6a

Signed-off-by: Alexander Shabanov <[email protected]>

alexander-shabanov force-pushed the reply_offload branch from cb5e97f to 2293a6a Compare January 20, 2025 07:30

Reply offload #1457

Are you sure you want to change the base?

Reply offload #1457

Conversation

alexander-shabanov commented Dec 18, 2024 • edited Loading

Overview

Key Changes

Implementation Details

Reply construction:

Write-to-client infrastructure:

Internal configuration:

Testing

Performance Tests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madolson commented Dec 18, 2024

madolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuiderkwast commented Dec 18, 2024

alexander-shabanov commented Dec 19, 2024

alexander-shabanov commented Dec 20, 2024 • edited Loading

alexander-shabanov commented Dec 20, 2024 • edited Loading

zuiderkwast commented Dec 20, 2024

alexander-shabanov commented Dec 23, 2024

Choose a reason for hiding this comment

alexander-shabanov Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexander-shabanov Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

zuiderkwast commented Dec 25, 2024

alexander-shabanov commented Dec 26, 2024

ranshid commented Jan 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexander-shabanov Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuiderkwast commented Jan 9, 2025

alexander-shabanov commented Jan 13, 2025

alexander-shabanov commented Jan 13, 2025

ranshid commented Jan 13, 2025

ranshid commented Jan 13, 2025 • edited Loading

zuiderkwast commented Jan 13, 2025

ranshid commented Jan 14, 2025 • edited Loading

alexander-shabanov commented Jan 14, 2025

alexander-shabanov commented Jan 14, 2025 • edited Loading

alexander-shabanov commented Jan 14, 2025

alexander-shabanov commented Jan 16, 2025

codecov bot commented Jan 16, 2025 • edited Loading

Codecov Report

alexander-shabanov commented Dec 18, 2024 •

edited

Loading

alexander-shabanov commented Dec 20, 2024 •

edited

Loading

alexander-shabanov commented Dec 20, 2024 •

edited

Loading

alexander-shabanov Dec 24, 2024 •

edited

Loading

alexander-shabanov Dec 25, 2024 •

edited

Loading

alexander-shabanov Jan 9, 2025 •

edited

Loading

ranshid commented Jan 13, 2025 •

edited

Loading

ranshid commented Jan 14, 2025 •

edited

Loading

alexander-shabanov commented Jan 14, 2025 •

edited

Loading

codecov bot commented Jan 16, 2025 •

edited

Loading