Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicates messages are not correctly deduplicated using dedup at exchange level #80

Open
fish68 opened this issue Feb 9, 2022 · 12 comments

Comments

@fish68
Copy link

fish68 commented Feb 9, 2022

hello,
I have the following setup:

RabbitMQ 3.8.27 (Erlang 23.2) running on Windows Server 2016
I've recently upgraded to elixir 1.12.2 and deduplication plugin 0.5.3.

I run a 3 nodes cluster with the following policy to mirror queues, exchanges are not mirrored:

image

I have an exchange configured in the following way:

image

the deduplication works pretty fine most of the times, but I have some messages not correctly deduplicated and consumed twice.

for example yesterday I had 4 cases out of ~9000 unique messages; in one of the cases the publish timestamps were:

2022-02-08 14:28:37.550
2022-02-08 14:28:37.587

so it should be handled by the configured TTL.

the deduplication key is a string (something like 101.15010) and the average payload is a JSON of about 600bytes, maximum 1Kbyte.

do you have any advice on how this could be solved or how to get some internal logs about the plugin?
thank you
Roberto

@noxdafox
Copy link
Owner

noxdafox commented Feb 9, 2022

I think the culprit might lie in the cache being full.

Currently, the x-cache-size parameter controls the maximum amount of members within the cache. When the cache overflows, an item is removed to make space for the new one. Ideally, the new item should replace the oldest item (LRU caching) but currently this is not the case. This is a limitation of the caching layer I am planning to fix in the future.

So what might be happening is that you have a full cache, a new item comes in and replace a very recent one resulting in what you observe.

To reduce the occurrence of such issues i would recommend to increase the x-cache-size parameter and see if it helps. Right now you have a value of 10000 and if you publish ~9000 unique messages you are pretty close to overflow. If there where old messages already in the cache this might explain the issue.

@fish68
Copy link
Author

fish68 commented Feb 9, 2022

Hi,
thank you for the very quick reply; of course I can try to increase the size but those 9000 messages were published over a period of 2 hours, I guessed the cache would start to remove the oldest and keep the newest messages.
all my duplicated messages are published within few milliseconds from each other.

My fear is that increasing the cache to a 10x value could be solving the issue for the moment but that I could have a similar issue after few days of processing

@noxdafox
Copy link
Owner

noxdafox commented Feb 9, 2022

Nevermind what I just said above. I now recall the x-cache-ttl is in milliseconds and not seconds. Hence your entries are having a 1 second TTL and not ~20 minutes.

If I understand correctly, the timestamp above are generated at the publisher side. Am I correct?

If so, are you sure you are not affected by network congestion induced delays?

  1. Publisher A creates message A1 and timestamps it.
  2. Publisher A publishes the message immediately.
  3. Publisher B creates message B1 timestamps it.
  4. Publisher B tries to publish the message but network is congested and there is a delay.
  5. RMQ receives message A1, forwards it and add it to the cache.
  6. A1 cache entry expires
  7. Finally B1 arrives, no cache entry for it. Message goes through.

One way you could try debugging that is by using the timestamp exchange to get the message timestamp when it reaches the broker. In this way you could spot if these are the cases which affect you. Obviously, there might still be delays between the time the broker receives the message and the moment it actually compares it with the duplicates cache. This might be due to a very busy broker.

@fish68
Copy link
Author

fish68 commented Feb 9, 2022

I don't think it's that; I'm logging also at consumer side, so the messages are published at the following timestamps

2022-02-08 14:28:37.550
2022-02-08 14:28:37.587
and received at

2022-02-08 14:28:37.633
2022-02-08 14:28:37.700

so TTL is less than 1000 milliseconds between publish and consuming

I fear that the problem could be related to the LB hitting a different server of the cluster, is there any way to log that?

@noxdafox
Copy link
Owner

noxdafox commented Feb 9, 2022

The cache is shared across nodes and it's transaction-based. It's not local per node.

If it would be your LB sending the client to different nodes, you would notice a way higher amount of duplicates I guess?

@fish68
Copy link
Author

fish68 commented Feb 9, 2022

ok,
then I have no clue... according to the documentation and to what you wrote me above it should be working... and it's actually fine most of the times

@noxdafox
Copy link
Owner

noxdafox commented Feb 9, 2022

I will try to reproduce locally and perform some large-scale testing. I never observed such behaviour before in my local tests.

Unfortunately, I am quite swamped at the moment. I hope I get some spare time at the beginning of March for this.

Are you sure the client is not overriding the TTL and the cache is not overflowing? You can check the cache size by querying the exchange info.

Example:

$ curl -s -u guest:guest http://localhost:15672/api/exchanges | jq '.[] | select(.name=="exchange-name")'
{
  "arguments": {
    "x-cache-size": 1000,
    "x-cache-ttl": 100000
  },
   "cache_info": {
    "bytes": 3224,
    "entries": 9,
    "size": 1000
  }
  ...
}

@fish68
Copy link
Author

fish68 commented Feb 9, 2022

ok, thank you... I will try to query the exchange to double check those params

@fish68
Copy link
Author

fish68 commented Feb 9, 2022

I have checked tha params and they look correct

 {
      "arguments":{
         "x-cache-size":10000,
         "x-cache-ttl":"1000"
      },
      "auto_delete":false,
      "cache_info":{
         "bytes":2472,
         "entries":0,
         "size":10000
      },
      "durable":true,
      "internal":false,
      "message_stats":{
         "publish_in":4956,
         "publish_in_details":{
            "rate":0.0
         },
         "publish_out":2500,
         "publish_out_details":{
            "rate":0.0
         }
      },
      "name":"ex-Gateway-1",
      "type":"x-message-deduplication",
      "user_who_performed_action":"myuser",
      "vhost":"SAC"
   }

I will recheck them while stressing the system to see if the number of elements in the cache gets too high

@noxdafox
Copy link
Owner

noxdafox commented Feb 9, 2022

It would be great to get those numbers especially close to a moment in which we see duplicates occurring.

On my side I'll try spinning a docker-compose environment and bombarding the system, but I can only start from March.

@fish68
Copy link
Author

fish68 commented Feb 9, 2022

yes,
I think I can do it.. next time I start a massive test I will keep monitoring those values

@neelkamath
Copy link

I'm facing this issue too.

I have an exchange named deduplicator:

Screen Shot 2022-12-21 at 6 48 56 PM

I've set the relevant arguments such as x-cache-persistence: disk, and bound it to a queue named temp (usually it's just bound to msgs but I added temp here for the demo):
Screen Shot 2022-12-21 at 6 51 29 PM

The temp queue received 16 messages having the same deduplication header as shown via "find in page":
Screen Shot 2022-12-21 at 6 51 16 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants