Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classic queue mirror runs into a function_clause during a rolling upgrade to 3.12.0-beta.6 #7883

Closed
gomoripeti opened this issue Apr 12, 2023 · 3 comments
Labels

Comments

@gomoripeti
Copy link
Contributor

Describe the bug

When upgrading a multi-node cluster from 3.11.10 to 3.12.0-beta.6 I see the below crash.

** Generic server <0.1683.0> terminating
** Last message in was {'$gen_cast',
                           {gm,{discard,<15210.8329.0>,flow,
                                   <<244,215,79,6,212,217,23,44,46,232,176,
                                     225,205,41,240,91>>}}}
...
** Reason for termination ==
** {function_clause,
    [{rabbit_mirror_queue_slave,process_instruction,
      [{discard,<15210.8329.0>,flow,
        <<244,215,79,6,212,217,23,44,46,232,176,225,205,41,240,91>>},
       {state,
...

I think it is related to the following PR #7802. In case of a rolling upgrade the queue master is on the old version while the slave is on the new version.

Unfortunately I don't fully understand when the discard message is sent and whether our load-test tool uses auto-ack.

Reproduction steps

  1. Create a 3-node cluster with RabbitMQ 3.11.10
  2. Create a mirrored classic queue with master on node-02 and slave on node-01
  3. Generate continuous moderate traffic (I used ~0.25 msg/sec both publish and consume rate and reconnecting clients)
  4. Upgrade node-01 to 3.12.0-beta.6
    ...

Expected behavior

No crash in case of rolling upgrade to 3.12.0 in presence of mirrored classic queues.

Additional context

If I understand correctly #7802 changed the format of a message passed across nodes. That could be addressed by a feature flag. But maybe simpler to just handle the old format as well for the time being.

@gomoripeti gomoripeti added the bug label Apr 12, 2023
@michaelklishin michaelklishin changed the title Mirrored CQ slave crash during rolling upgrade to 3.12.0-beta.6 Classic queue mirror runs into a function_clause during a rolling upgrade to 3.12.0-beta.6 Apr 13, 2023
@michaelklishin
Copy link
Member

This is not something that's easy to hide behind a feature flag, the message (or previously, message ID) is passed along multiple methods in several modules.

@michaelklishin
Copy link
Member

Can you try v3.12.0-alpha.110 to compare?

@gomoripeti
Copy link
Contributor Author

thanks I will try to test it in a few days. Im sure the revert eliminates the crash.
Im sorry to see the change reverted, I was thinking that the follower could be made backwards compatible (and in my case I've seen the follower being on the new version and the leader on old) but I guess in an absurd edge case (eg if a queue is declared while the cluster is already in a mixed version state) it is possible that the leader has the new version and follower the old, in which case the follower cannot be forward compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants