Classic queue mirror runs into a function_clause during a rolling upgrade to 3.12.0-beta.6 #7883

gomoripeti · 2023-04-12T21:25:09Z

Describe the bug

When upgrading a multi-node cluster from 3.11.10 to 3.12.0-beta.6 I see the below crash.

** Generic server <0.1683.0> terminating
** Last message in was {'$gen_cast',
                           {gm,{discard,<15210.8329.0>,flow,
                                   <<244,215,79,6,212,217,23,44,46,232,176,
                                     225,205,41,240,91>>}}}
...
** Reason for termination ==
** {function_clause,
    [{rabbit_mirror_queue_slave,process_instruction,
      [{discard,<15210.8329.0>,flow,
        <<244,215,79,6,212,217,23,44,46,232,176,225,205,41,240,91>>},
       {state,
...

I think it is related to the following PR #7802. In case of a rolling upgrade the queue master is on the old version while the slave is on the new version.

Unfortunately I don't fully understand when the discard message is sent and whether our load-test tool uses auto-ack.

Reproduction steps

Create a 3-node cluster with RabbitMQ 3.11.10
Create a mirrored classic queue with master on node-02 and slave on node-01
Generate continuous moderate traffic (I used ~0.25 msg/sec both publish and consume rate and reconnecting clients)
Upgrade node-01 to 3.12.0-beta.6
...

Expected behavior

No crash in case of rolling upgrade to 3.12.0 in presence of mirrored classic queues.

Additional context

If I understand correctly #7802 changed the format of a message passed across nodes. That could be addressed by a feature flag. But maybe simpler to just handle the old format as well for the time being.

The text was updated successfully, but these errors were encountered:

michaelklishin · 2023-04-13T07:18:10Z

This is not something that's easy to hide behind a feature flag, the message (or previously, message ID) is passed along multiple methods in several modules.

michaelklishin · 2023-04-13T14:16:06Z

Can you try v3.12.0-alpha.110 to compare?

gomoripeti · 2023-04-14T09:26:08Z

thanks I will try to test it in a few days. Im sure the revert eliminates the crash.
Im sorry to see the change reverted, I was thinking that the follower could be made backwards compatible (and in my case I've seen the follower being on the new version and the leader on old) but I guess in an absurd edge case (eg if a queue is declared while the cluster is already in a mixed version state) it is possible that the leader has the new version and follower the old, in which case the follower cannot be forward compatible.

gomoripeti added the bug label Apr 12, 2023

michaelklishin changed the title ~~Mirrored CQ slave crash during rolling upgrade to 3.12.0-beta.6~~ Classic queue mirror runs into a function_clause during a rolling upgrade to 3.12.0-beta.6 Apr 13, 2023

michaelklishin mentioned this issue Apr 13, 2023

Revert "Pass the message to rabbit_backing_queue:discard callback " #7884

Merged

michaelklishin closed this as completed Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classic queue mirror runs into a function_clause during a rolling upgrade to 3.12.0-beta.6 #7883

Classic queue mirror runs into a function_clause during a rolling upgrade to 3.12.0-beta.6 #7883

gomoripeti commented Apr 12, 2023

michaelklishin commented Apr 13, 2023

michaelklishin commented Apr 13, 2023

gomoripeti commented Apr 14, 2023

Classic queue mirror runs into a function_clause during a rolling upgrade to 3.12.0-beta.6 #7883

Classic queue mirror runs into a function_clause during a rolling upgrade to 3.12.0-beta.6 #7883

Comments

gomoripeti commented Apr 12, 2023

Describe the bug

Reproduction steps

Expected behavior

Additional context

michaelklishin commented Apr 13, 2023

michaelklishin commented Apr 13, 2023

gomoripeti commented Apr 14, 2023