Adding TTL to proxied messages #276

nash8114 · 2024-09-13T15:21:11Z

A message proxied to another node by the DynamicSupervisor can be proxied again. This is clearly evident when using a large number of nodes and picking Horde.UniformRandomDistribution as a distribution strategy. Pushing your luck with that strategy 🍀

This commit contains an implementation which limits the amount of times a message can be proxied before expiring, by adding a TTL which works similar to the TTL on IP packets.

The default TTL is :infinity, which means the implementation is backwards compatible. Message with TTL :infinity can bounce around between nodes forever. The max TTL can be set to any integer via the new :proxy_message_ttl option. Each "hop" decreases the TTL for a message by 1 one. When a message with a TTL of zero needs to be proxied, an error will be returned to the reply_to process.

With a distribution strategy of Horde.UniformDistribution the issue of passing messages around forever is unlikely, as nodes tend to agree on the outcome of choose_node. However, a recent incident whilst upgrading to OTP 27 brought this issue to light, as the underlying algorithm for choosing nodes was broken on our already upgraded nodes. This causes message to be proxied between our nodes infinitely.

Please feel free to pass any form of judgement on the implementation. This is just a jumping-off platform and it can only go uphill from here 👍

With TTL of 2:

graph TD;
p[process A] -- call start_child() --> A[node A];
A -- proxy start_child(ttl: 2) --> B[node B];
B -- proxy start_child(ttl: 1) --> C[node C];
C -- reply {:error, :proxy_operation_ttl_expired, ...} --> p;

A message proxied to another node can be proxied again. This is clearly evident when picking `Horde.UniformRandomDistribution` as a distribution strategy, especially in combination with a large number of nodes. This commit contains an implementaiton which limits the amount of times a message can be proxied before expiring, by adding a TTL which works similar to the TTL on IP packets.

As indicated to me by a colleague after a code review, the implementation will cause issues when upgrading from the current version to the new version. The tuple size for :proxy_operation changed from 3 to 4, which means nodes with old and new versions won't be able to proxy to eachother. This commit mitigates that to a degree. The new implementation supports both tuple size 3 and 4. And when :proxy_message_ttl is set to :infinity (the default), `proxy_to_node` will use tuple size 3. This means that upgrade is possible if during the upgrade `:proxy_message_ttl` option is not set, or set to :infinity. If it is set to an integer value, the upgrade will cause proxy messeage from upgraded nodes to old nodes to fail. The CHANGELOG would need to reflect this as an upgrade risk.

nash8114 · 2024-09-16T15:21:01Z

I've pushed a new commit. The pushed changes should help fix an issue upgrading to this new version.
The implementation now supports both tuple size 3 and 4 for :proxy_operation. When :proxy_message_ttl is set to :infinity (the default), proxy_to_node will use tuple size 3. This means nodes running the new version and the old version can still proxy messages to each other. If during the upgrade :proxy_message_ttl would be set to an integer value, the upgrade will cause :proxy_operation messeages from upgraded nodes to old nodes to fail. The CHANGELOG would need to reflect this as an upgrade risk.

derekkraan · 2024-09-16T15:22:42Z

Yes, please do add an entry to the changelog for this.

derekkraan · 2024-09-18T06:04:28Z

The approach you are taking in this PR is to send an error when TTL goes to 0.

But could we also just start the process when TTL is 0?

What do you think?

nash8114 · 2024-09-18T09:52:06Z

That sounds like a great idea.

I've avoided it so far as it means that proxy_to_node would need to be able to make a distinction between different messages. Currently it is completely message agnostic. The only two message types it currently handles are :start_child and :terminate_child. Performing :start_child locally when TTL runs out makes a lot of sense. The same is not true for terminate_child.

So if you're OK with proxy_to_node being message specific, I can build in an exception for :start_child. But how to handle an expired TTL for :terminate_child? This should not occur of course, except when the TTL is set to 0 or there is some corruption in the state (where two nodes disagree on who owns a process)

nash8114 · 2024-09-18T14:09:41Z

Pushed an update which changes the start_child logic to perform add_child locally if the TTL has expired. Tests have been updated accordingly.

On a another note, I am considering returning an error when starting a Horde.DynamicSupervisor with a :proxy_message_ttl set to zero, as obviously no one should ever set it to zero, and it would cause issues for terminate_child.

There is also still the issue that :proxy_message_ttl defaults to :infinity, which allows for upgrading seamlessly from a previous version to this version. But at what point would the :proxy_message_ttl default to something else? Or should the README indicate that setting the proxy TTL (after upgrade) is recommended?

With TTL of 2:

terminate_child is not likely to loop, as the node on which a process is running is known. If it does loop for some odd reason, the message would expire.

graph TD;
pa[process A] -- call start_child() --> A[node A];
A -- proxy start_child(ttl: 2) --> B[node B];
B -- proxy start_child(ttl: 1) --> C[node C];
C -- TTL expired, perform add_child and reply {:ok,&nbsp;pid} --> pa;

pb[process B] -- call terminate_child() --> Ab[node A];
Ab -- proxy terminiate_child(ttl:&nbsp;2) --> Bb[node B];
Bb -- reply :ok --> pb;

derekkraan · 2024-09-18T15:27:27Z

If we document it in the documentation and in the changelog, that should be enough. I'll emphasise it when doing the usual tweets so that hopefully more people get the message. But maybe it's also something most people will only discover if they also have the issue in question.

nash8114 · 2024-09-18T18:29:24Z

I'll write something about the option in the readme.

In normal cases the TTL should not be necessary, but there are some cases which are improved by this. Most notably:

TTL allows a seamless migration to a different distribution algorithm. Today migrating to a different algorithm would likely cause loops. We are considering changing algorithm, and that is difficult to do unless you're really strategic about it.
TTL prevents the UniformRandomDistribution algorithm from passing messages too often (image 100 nodes with UniformRandomDistribution, each having a 1% chance of actually handling a start_child, and 99% change of playing hot potato)
TTL prevents issues when the algorithm fails, like we experienced in our application due to an Erlang upgrade.

…option

Han Bol added 2 commits September 13, 2024 16:53

Proxied message for start_child are handled locally when TTL has expired

19e88e7

Updated documentation of Horde.DynamicSupervisor to indicate the new …

38dccad

…option

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding TTL to proxied messages #276

Adding TTL to proxied messages #276

nash8114 commented Sep 13, 2024

nash8114 commented Sep 16, 2024

derekkraan commented Sep 16, 2024

derekkraan commented Sep 18, 2024

nash8114 commented Sep 18, 2024

nash8114 commented Sep 18, 2024 •

edited

Loading

derekkraan commented Sep 18, 2024

nash8114 commented Sep 18, 2024

Adding TTL to proxied messages #276

Are you sure you want to change the base?

Adding TTL to proxied messages #276

Conversation

nash8114 commented Sep 13, 2024

With TTL of 2:

nash8114 commented Sep 16, 2024

derekkraan commented Sep 16, 2024

derekkraan commented Sep 18, 2024

nash8114 commented Sep 18, 2024

nash8114 commented Sep 18, 2024 • edited Loading

With TTL of 2:

derekkraan commented Sep 18, 2024

nash8114 commented Sep 18, 2024

nash8114 commented Sep 18, 2024 •

edited

Loading