Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test "Protocol desync regression test" in test_protocol.tcl not exactly checking the desync bug #1583

Closed
netanelrevah opened this issue Jan 18, 2025 · 5 comments · Fixed by #1590
Assignees

Comments

@netanelrevah
Copy link

this is the test:

if {$::tls} {
                set s [::tls::socket [srv 0 host] [srv 0 port]]
            } else {
                set s [socket [srv 0 host] [srv 0 port]]
            }
            puts -nonewline $s $seq
            set payload [string repeat A 1024]"\n"
            set test_start [clock seconds]
            set test_time_limit 30
            while 1 {
                if {[catch {
                    puts -nonewline $s payload
                    flush $s
                    incr payload_size [string length $payload]
                }]} {
                    set retval [gets $s]
                    close $s
                    break
                } else {
                    set elapsed [expr {[clock seconds]-$test_start}]
                    if {$elapsed > $test_time_limit} {
                        close $s
                        error "assertion:Valkey did not closed connection after protocol desync"
                    }
                }
            }
            set retval

the test write null "\x00" to the server and then write a lot of "payload" string.

then get the error "-ERR Protocol error: too big inline request\r\n"

it will also fail the same without sending any of the \x00 character.

we should choose from:

  1. fix the test
  2. fix the \x00 behaviour as excpected
  3. remove the test
@ranshid
Copy link
Member

ranshid commented Jan 19, 2025

I think the \x00 was only added for inline protocol testing and to simplify the test covering both bulk/array/inline cases.
Maybe this test would be better fit into violations.tcl, but I am not sure we have another dedicated test for that.
What I do think is wrong with that test, is that is is time based and not size bounded. I mean for the inline case the server should have close the connection after 64 iterations. Maybe we can focus on fixing that?

@netanelrevah
Copy link
Author

I think the \x00 was only added for inline protocol testing and to simplify the test covering both bulk/array/inline cases. Maybe this test would be better fit into violations.tcl, but I am not sure we have another dedicated test for that. What I do think is wrong with that test, is that is is time based and not size bounded. I mean for the inline case the server should have close the connection after 64 iterations. Maybe we can focus on fixing that?

i think that's the relevant issue: redis/redis#141 (comment)

the test tryingcheck that the buffer overflow with long inline protocol data but it has some coding problems (send "payload" instead of "$payload", incr payload_size which no used in any place after, and wait for any protocol error)

anyway, i think having test that just checking the inline protocol error for long inline input, and having test for prevent buffer overflow in the unit test (src/unit ones in c) can do better.

this test here doesn't really guard the desync bug, it's just imlicitly checking that there is limit for inline input.

@netanelrevah netanelrevah changed the title Test "Protocol desync regression test" in test_protocol.tcl do nothing. Test "Protocol desync regression test" in test_protocol.tcl not exactly checking the desync bug Jan 19, 2025
@ranshid
Copy link
Member

ranshid commented Jan 19, 2025

AFAIU the test is mainly in order to verify the added protection for multi-bulk and inline protocol are working as expected. I do not think this is ONLY for inline protocol limit testing. Removing the test would basically mean we do not have a good code coverage. I do find some good reasoning behind keeping the test since we might also refactor some of the code for the command parsing.

anyway, i think having test that just checking the inline protocol error for long inline input, and having test for prevent buffer overflow in the unit test (src/unit ones in c) can do better.

We might have unitested it but it would become harder to have a good functional test since, for example, this is also performed from io-threads. Currently the io-threads are using the same code executed by the engine, but it is possible that some day in the future they will have to diverge and this test might help with future changes validations.

the test tryingcheck that the buffer overflow with long inline protocol data but it has some coding problems (send "payload" instead of "$payload", incr payload_size which no used in any place after, and wait for any protocol error)

I totaly agree the test have some fixes needed:

  1. puts -nonewline $s payload is a good catch. it does not mean the test does not work (right?) it only means that there is a higher chance that the test will fail on timeout. lets fix it.
  2. verify payload_size - this is probably something we would like to check against. at least for the 64K hardcoded inline protocol size.

@ranshid
Copy link
Member

ranshid commented Jan 20, 2025

@netanelrevah take a look at #1590 and tell me what you think.

@netanelrevah
Copy link
Author

@netanelrevah take a look at #1590 and tell me what you think.

yeah, this kind of fix is better to test the inline max size. thanks for fixing :)

@ranshid ranshid self-assigned this Jan 20, 2025
kronwerk pushed a commit to kronwerk/valkey that referenced this issue Jan 27, 2025
The desync regression test was created as a regression test for the
following bug:
in case we embed NULL termination inside inline/multi-bulk message we
will not be able to perform strchr in order to
identify the newline(\n)/carriage-return(\r) in the client query buffer.
this can influence (for example) replica reading primary stream and keep
filling it's query buffer endlessly consuming more and more memory.

In order to handle the above risk, a check was added to verify the
inline bulk and multi-bulk size are not exceeding the 64K bytes in the
query-buffer. A test was placed in order to verify this.

This PR introduce the following fixes to the desync regression test:
1. fix the sent payload to flush 1024 bytes block of 'A's instead of
'payload' which was sent by mistake.
2. Make sure that the connection is correctly terminated on protocol
error by the server after exceeding the 64K and not over 64K.
3. add another test intrinsic which will also verify the nested bulk
with embedded null termination (was not verified before)

fixes valkey-io#1583


NOTE: Although it is possible to change the use of strchr to a more
"safe" utility (eg memchr) which will not pause scan at first occurrence
of '\0', we still like to protect against over excessive usage of the
query buffer and also preserve the current behavior(?). We will look
into improving this though in a followup issue.

---------

Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: ranshid <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants