Test "Protocol desync regression test" in test_protocol.tcl not exactly checking the desync bug #1583

netanelrevah · 2025-01-18T22:40:16Z

this is the test:

if {$::tls} {
                set s [::tls::socket [srv 0 host] [srv 0 port]]
            } else {
                set s [socket [srv 0 host] [srv 0 port]]
            }
            puts -nonewline $s $seq
            set payload [string repeat A 1024]"\n"
            set test_start [clock seconds]
            set test_time_limit 30
            while 1 {
                if {[catch {
                    puts -nonewline $s payload
                    flush $s
                    incr payload_size [string length $payload]
                }]} {
                    set retval [gets $s]
                    close $s
                    break
                } else {
                    set elapsed [expr {[clock seconds]-$test_start}]
                    if {$elapsed > $test_time_limit} {
                        close $s
                        error "assertion:Valkey did not closed connection after protocol desync"
                    }
                }
            }
            set retval

the test write null "\x00" to the server and then write a lot of "payload" string.

then get the error "-ERR Protocol error: too big inline request\r\n"

it will also fail the same without sending any of the \x00 character.

we should choose from:

fix the test
fix the \x00 behaviour as excpected
remove the test

The text was updated successfully, but these errors were encountered:

ranshid · 2025-01-19T07:46:29Z

I think the \x00 was only added for inline protocol testing and to simplify the test covering both bulk/array/inline cases.
Maybe this test would be better fit into violations.tcl, but I am not sure we have another dedicated test for that.
What I do think is wrong with that test, is that is is time based and not size bounded. I mean for the inline case the server should have close the connection after 64 iterations. Maybe we can focus on fixing that?

netanelrevah · 2025-01-19T13:09:29Z

I think the \x00 was only added for inline protocol testing and to simplify the test covering both bulk/array/inline cases. Maybe this test would be better fit into violations.tcl, but I am not sure we have another dedicated test for that. What I do think is wrong with that test, is that is is time based and not size bounded. I mean for the inline case the server should have close the connection after 64 iterations. Maybe we can focus on fixing that?

i think that's the relevant issue: redis/redis#141 (comment)

the test tryingcheck that the buffer overflow with long inline protocol data but it has some coding problems (send "payload" instead of "$payload", incr payload_size which no used in any place after, and wait for any protocol error)

anyway, i think having test that just checking the inline protocol error for long inline input, and having test for prevent buffer overflow in the unit test (src/unit ones in c) can do better.

this test here doesn't really guard the desync bug, it's just imlicitly checking that there is limit for inline input.

ranshid · 2025-01-19T18:23:28Z

AFAIU the test is mainly in order to verify the added protection for multi-bulk and inline protocol are working as expected. I do not think this is ONLY for inline protocol limit testing. Removing the test would basically mean we do not have a good code coverage. I do find some good reasoning behind keeping the test since we might also refactor some of the code for the command parsing.

anyway, i think having test that just checking the inline protocol error for long inline input, and having test for prevent buffer overflow in the unit test (src/unit ones in c) can do better.

We might have unitested it but it would become harder to have a good functional test since, for example, this is also performed from io-threads. Currently the io-threads are using the same code executed by the engine, but it is possible that some day in the future they will have to diverge and this test might help with future changes validations.

the test tryingcheck that the buffer overflow with long inline protocol data but it has some coding problems (send "payload" instead of "$payload", incr payload_size which no used in any place after, and wait for any protocol error)

I totaly agree the test have some fixes needed:

puts -nonewline $s payload is a good catch. it does not mean the test does not work (right?) it only means that there is a higher chance that the test will fail on timeout. lets fix it.
verify payload_size - this is probably something we would like to check against. at least for the 64K hardcoded inline protocol size.

ranshid · 2025-01-20T08:32:10Z

@netanelrevah take a look at #1590 and tell me what you think.

netanelrevah · 2025-01-20T10:29:35Z

@netanelrevah take a look at #1590 and tell me what you think.

yeah, this kind of fix is better to test the inline max size. thanks for fixing :)

The desync regression test was created as a regression test for the following bug: in case we embed NULL termination inside inline/multi-bulk message we will not be able to perform strchr in order to identify the newline(\n)/carriage-return(\r) in the client query buffer. this can influence (for example) replica reading primary stream and keep filling it's query buffer endlessly consuming more and more memory. In order to handle the above risk, a check was added to verify the inline bulk and multi-bulk size are not exceeding the 64K bytes in the query-buffer. A test was placed in order to verify this. This PR introduce the following fixes to the desync regression test: 1. fix the sent payload to flush 1024 bytes block of 'A's instead of 'payload' which was sent by mistake. 2. Make sure that the connection is correctly terminated on protocol error by the server after exceeding the 64K and not over 64K. 3. add another test intrinsic which will also verify the nested bulk with embedded null termination (was not verified before) fixes valkey-io#1583 NOTE: Although it is possible to change the use of strchr to a more "safe" utility (eg memchr) which will not pause scan at first occurrence of '\0', we still like to protect against over excessive usage of the query buffer and also preserve the current behavior(?). We will look into improving this though in a followup issue. --------- Signed-off-by: Ran Shidlansik <[email protected]> Signed-off-by: ranshid <[email protected]>

netanelrevah changed the title ~~Test "Protocol desync regression test" in test_protocol.tcl do nothing.~~ Test "Protocol desync regression test" in test_protocol.tcl not exactly checking the desync bug Jan 19, 2025

ranshid mentioned this issue Jan 20, 2025

Fix Protocol desync regression test #1590

Merged

ranshid self-assigned this Jan 20, 2025

ranshid closed this as completed in #1590 Jan 20, 2025

ranshid closed this as completed in dd92d07 Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test "Protocol desync regression test" in test_protocol.tcl not exactly checking the desync bug #1583

Test "Protocol desync regression test" in test_protocol.tcl not exactly checking the desync bug #1583

netanelrevah commented Jan 18, 2025

ranshid commented Jan 19, 2025

netanelrevah commented Jan 19, 2025

ranshid commented Jan 19, 2025

ranshid commented Jan 20, 2025

netanelrevah commented Jan 20, 2025

Test "Protocol desync regression test" in test_protocol.tcl not exactly checking the desync bug #1583

Test "Protocol desync regression test" in test_protocol.tcl not exactly checking the desync bug #1583

Comments

netanelrevah commented Jan 18, 2025

ranshid commented Jan 19, 2025

netanelrevah commented Jan 19, 2025

ranshid commented Jan 19, 2025

ranshid commented Jan 20, 2025

netanelrevah commented Jan 20, 2025