chore: enhanced error propagation from publish + enable fast path send even observeres are added #1243

NagyZoltanPeter · 2025-02-06T09:42:52Z

Description

Nwaku new lightpush protocol enriches the information sent back to its clients in case a problem occurs during relaying messages. This implies enhancing pubsub/gossipsub publish method.
Along with the latest exception indication added in 1.8.0, I decided while keeping the original interface any errors will be propagated via exception to the caller.
Hence new exceptions are defined.

Disclaimer about the interface change.

To avoid breaking change in users of pubsub.publish method - that usually only expects a non-zero return value in case of success and zero in case of any problem occurs.
Therefore old publish interface will digest all exception and translate it zero return.
New publishEx interface is added to be used by waku and for any other heavy user that needs more comprehensive feedback from the operation.

Optimizing message observers

Another shortcoming found and enhanced within this PR. While using any kind of observer on pubsub message relaying forces messages to be encoded as many peers it sends to. It is totally useless if the observer does not want to modify the message. While keeping the option to add an invasive observer before message-send, it is extended to having a view-only after-send observer. This allows fast path send whenever possible.

pubsub/gossipsub publish will raise specific exception according to problem occured during processing the message. pubsub observers are modified with distinch onBeforeSend and onAfterSent callbacks - this helps keep sending messages on fast path. (This is needed for nwaku - that has no intent on modifying message before forwarding but still observs them)

…riginal behavior and separate new exception based error propagation under publishEx interface

github-actions · 2025-02-06T09:43:14Z

Commits must follow the Conventional Commits specification
Please fix these commit messages:

Fix infinite recursive calls in publish, extend observer tests
non-virtual-interface pattern applied for publish interface to keep original behavior and separate new exception based error propagation under publishEx interface
Adapt pubsub test to changed obeservers

arnetheduck · 2025-02-06T09:54:43Z

the original sin here is that publish allows raising errors at all - publish is a "broadcast" operation, ie "fire-and-forget" - its nature is such that it cannot fail by definition and thus shouldn't be raising anything at all - if the higher-level logic considers some of these things an error, this function can return a richer set of statistics than "number of peers that wanted the message".

arnetheduck · 2025-02-06T09:55:10Z

libp2p/protocols/pubsub/errors.nim

+type
+  PublishingError* = object of LPError
+
+  NoTopicSpecifiedError* = object of PublishingError


this is an API violation, should not be a catchable error

I don't get your point here.

the caller is responsible for verifying that the topic is not empty - this is not something to raise an exception about (it could potentially be an assert)

In pubsub, a "" string is allowed as there's no explicit validation rejecting it even if it semantically makes no sense. In https://github.com/libp2p/specs/blob/master/pubsub/README.md#the-topic-descriptor it is specified names can be descriptive or random or anything that the creator chooses.

Not allowing empty topics should be a concern for the application using libp2p, and we should not enforce this in nim-libp2p.

arnetheduck · 2025-02-06T09:55:44Z

libp2p/protocols/pubsub/errors.nim

+  PublishingError* = object of LPError
+
+  NoTopicSpecifiedError* = object of PublishingError
+  PayloadIsEmptyError* = object of PublishingError


this is either an api violation or allowed (I don't remember if empty payloads are allowed by libp2p)

https://github.com/libp2p/specs/tree/master/pubsub#the-message suggests that at least from libp2p spec perspective t's allowed:

The data (optional) field is an opaque blob of data representing the payload. It can contain any data that the publisher wants it to.

We will skip publishing anyway, hance this feedback.

A couple of questions:

Is this not basically another logic error in the calling side? If the caller doesn't want to send empty messages, and the caller presumably typically has some framing protocol itself whereby even say "empty" chat messages (or other Waku messages without Waku-application-level-payloads) would from a libp2p perspective have some content; and probably more importantly;

it's inexpensively detectable at the caller end, and creates a new exception where one doesn't current exist by involving libp2p in this detection process.

So it's basically another logic error.

I agree with @tersec. The caller is the one having the context to determine whether a message with empty data is valid or not. Adding this validation in publish would mean we're no longer being compliant with the spec.

arnetheduck · 2025-02-06T09:56:01Z

libp2p/protocols/pubsub/errors.nim

+
+  NoTopicSpecifiedError* = object of PublishingError
+  PayloadIsEmptyError* = object of PublishingError
+  DuplicateMessageError* = object of PublishingError


Publishing an "already known" message is not an error in distributed systems in general - it is expected that it can happen as a consequence of the chosen message identity function because messages can be generated in arbitrary order and sometimes end up being the same (imagine two nodes existing for redundancy and publishing the same data to the network from different entry points into the mesh).

You're right, but it is still information for the caller why the message is not relayed, so it might not need to re-try with this peer, maybe not with other peers.

You're right, but it is still information for the caller why the message is not relayed, so it might not need to re-try with this peer, maybe not with other peers.

When the message is in seen, the message has been relayed - either directly, indirectly or topped up with IHAVE/IWANT.

You can think of publish as "put this message on the pubsub network where it will stay for a period of time as agreed by the seen/cache timeouts" - after that's been done once by any peer for a particular message (as identified by the id function), the network, and not the caller, is responsible for ensuring that all peers get to observe it - it would be incorrect for lightpush to try again and the API of publish should reflect this point so that users of libp2p don't make this mistake in their code.

arnetheduck · 2025-02-06T09:56:15Z

libp2p/protocols/pubsub/errors.nim

+  NoTopicSpecifiedError* = object of PublishingError
+  PayloadIsEmptyError* = object of PublishingError
+  DuplicateMessageError* = object of PublishingError
+  NotSubscribedToTopicError* = object of PublishingError


You don't have to be subscribed to publish messages

arnetheduck · 2025-02-06T09:56:28Z

libp2p/protocols/pubsub/errors.nim

+  PayloadIsEmptyError* = object of PublishingError
+  DuplicateMessageError* = object of PublishingError
+  NotSubscribedToTopicError* = object of PublishingError
+  NoPeersToPublishError* = object of PublishingError


this is covered by the return value

Yes, but I choose the way to treat as a failure as your message has not been tried to be relayed at all.
Still there is the possiblity it won't be relayed due to various reasons. That uncertainty is another story to tackle within waku.

This is not failure from the point of view of pubsub - the message has been put in the pubsub message cache and when a peer connects it will eventually get broadcast. Peers come and go and the system as a whole is designed to handle this particular case - any protocol using pubsub (such as lightpush) must be robust according to these semantics, ie handle them gracefully.

arnetheduck · 2025-02-06T09:57:18Z

libp2p/protocols/pubsub/errors.nim

+  DuplicateMessageError* = object of PublishingError
+  NotSubscribedToTopicError* = object of PublishingError
+  NoPeersToPublishError* = object of PublishingError
+  GeneratingMessageIdError* = object of PublishingError


this is a programming error - you should not be attempting to publish messages that are invalid by your own message id generation function - the default id generation function cannot fail

Well, I used a dumb approach not questioning the valid occurrence of an error just followed the code where it can return log and error and return 0.
I think it is there as message id generator can be replaced and there is no guarantee not getting error from it.

This seems like a buggy or badly-specified message ID function, then.

tersec · 2025-02-06T09:57:29Z

libp2p/protocols/pubsub/pubsubpeer.nim

+  p.observers = observers
+
+proc hasObservers*(p: PubSubPeer): bool =
+  p.observers != nil and anyIt(p.observers[], it != nil)


Why is it allowed to set nil observers within the list? It creates checking for this basically logic-error condition throughout the PR.

Good catch. Obviously not as it's being initialize with the pubsub instance.
It just had been there from loooong time.

I wonder why observers needs to be ref?

tersec · 2025-02-06T09:59:09Z

libp2p/protocols/pubsub/pubsub.nim

+): Future[int] {.async: (raises: [LPError]), public.} =
+  try:
+    return await p.doPublish(topic, data)
+  except LPError as ex:


As a general smaller (not-major-design-point), Nim's code generation can/does sometimes create more efficient code when the as ex part of this is excluded, as it simply never materializes the exception as a variable.

So if it's truly not used, better not to create it.

Yes, absolutely.

NagyZoltanPeter · 2025-02-06T10:30:38Z

@arnetheduck, @tersec : First of all thank you guys looking after it, I really appreciate all and any comments.
Will answer where I can.

The initial story behind is waku lightpush as a non-relay protocol access to the relay network needed some more detailed responses towards client side (aka edge node).
As since last year waku team has been heavily working on message delivery reliability features, this part should allow client side (SDK) to decide upon actions in case a message is not delivered. for some reason. We found it is not enough to say you have got zero peers your message delivered.
Unfortunately, many errors are hidden behind the curtains of the pub-sub interface—even those relating to configuration or client-side errors with the message itself.

There were two restrictions I put first - not to change the interface (...as I checked other libp2p implementations having exact same publish interface) and try not break other clients.
First hand with libp2p v1.8.0 as Richard added raises pragma it turned out that it is possible the publish can raise exception, this gave the option to propagate errors via exceptions. But I also found no-one expects exceptions by using publish(...) ... so it can easily break other users of it.
Hence the try separate publish and publishEx - for more comprehensive use.
(Also to notice I planned other performance enablers for waku on publishEx interface - but not in this PR)

NagyZoltanPeter · 2025-02-06T10:57:56Z

cc: @arnetheduck , @Ivansete-status, @richard-ramos:
I would appreciate a special review on the last commit:
aa5651b

This tries to utilize a polymorphic approach to separate the interface from internal implementation down the inheritance chain.

But. It turns to a seg-fault while running specific pubsub tests.
Like:
nim c -d:debug --styleCheck:usages --styleCheck:error --verbosity:0 --hints:off --skipUserCfg -f --threads:on --opt:speed -d:libp2p_pubsub_sign=true -d:libp2p_pubsub_verify=true -r tests/pubsub/testgossipsub

running it result an immediate segfault right at start without any stack trace. I may overlooked something however I tried out this solution in a dummy code, where it worked. So I may overlook something more complicated initialization within pubsub/floodsub/gossipsub.

Thank you.

update: I found the issue that caused an infinite recursion... my bad.

tersec · 2025-02-06T10:59:47Z

In general, #950 discusses this as well.

arnetheduck · 2025-02-06T11:04:34Z

many errors are hidden behind the curtains of the pub-sub interface

I think the content of this PR may reflect some conceptual misunderstandings with respect to what publish does as specified by libp2p - in particular the points about publishing various forms of invalid messages which suggest bugs in waku (if they are happening) as well as publishing things already in the seen cache which conceptually is not an error.

I see two outcomes from a PR like this:

the conceptual points should be explained in the function documentation, for example preconditions like "the message id function must be able to generate a message id for the message you're trying to publish" and semantic points of note ("a message with an id that has already been observed recently on the network will not be republished" and "you do not have to be subscribed to a topic to publish there").
the return value does indeed not distinguish between "no peers in mesh" and "message seen already" - this distinction could be worthy of a change / extension to the API because it indeed may represent an interesting point to make
- as an alternative to changing the API, we can change the return value in the case that the message is in "seen": instead of returning 0, we can return the number of peers in the mesh as an indication of how many peers "reasonably" know about the message (msg in seen indicates that all your neighbours reasonably can be assumed to have observed the message, just like after sending it to them).

richard-ramos · 2025-02-06T12:13:55Z

@arnetheduck: the original sin here is that publish allows raising errors at all - publish is a "broadcast" operation, ie "fire-and-forget" - its nature is such that it cannot fail by definition and thus shouldn't be raising anything at all

There's currently an scenario in which publish can raise an LPError, if anonimize is false and we enter this branch of code:

nim-libp2p/libp2p/protocols/pubsub/rpc/message.nim

Lines 85 to 86 in c5aa373

    
           if sign: 
        
             raise (ref LPError)(msg: "Cannot sign message without peer info")

This can happen if passing a nil peerInfo when calling newSwitch. I identified this while adding the raises annotations in the pubsub module.

If SwitchBuilder is used, this scenario will not happen, because build will populate the peerInfo. So maybe we can instead of raising the LPError, to doAssert(false, "Cannot sign message without peer info"), as running into this is likely a misconfiguration?

arnetheduck · 2025-02-06T12:25:23Z

as running into this is likely a misconfiguration?

yeah - that's probably reasonable - or we can log it, or we can maybe check for it during startup ..

The risk we want to avoid is that libp2p crashes in potentially "uncommon" scenarios when it already looks like it's "up and running" - if we add an assert here, we should also add a "regular" error somewhere else so that it (ideally) becomes hard or impossible to reach the point of trying to publish messages like this.

arnetheduck · 2025-02-06T12:26:58Z

ie the intent in past discussions about publish was that it should never raise anything - that you discovered a raise is testament to the efficacy of adding raises annotations to enforce things ;)

Ivansete-status

Thanks for it! 🙌
Some comments so far

Ivansete-status · 2025-02-06T10:39:33Z

libp2p/protocols/pubsub/gossipsub.nim

    g: GossipSub, topic: string, data: seq[byte]
 ): Future[int] {.async: (raises: [LPError]).} =
  logScope:
    topic

  if topic.len <= 0: # data could be 0/empty
    debug "Empty topic, skipping publish"
-    return 0
+    raise newException(NoTopicSpecifiedError, "Empty topic, skipping publish")


I'd suggest using Result instead of exceptions. Applies elsewhere.
I also think we don't need new error types.

I'd rather go for something much simpler like:

Future[Result[int, string]]

I don't care the type of error. To me, it is only interesting that an error happened and I can get the detail directly from the underlying string.

Yeah we had this debate and we can argue about it different ways.
It is clear the Future is designed to be able to handle different failure cases.
One is that is capable to catch exceptions and report it to the listener on the other end as error.
This is because async task might not get waited where it has been launched. So exception handling might detach.
If you do await it, it will be raised so you need to catch it, while if you just let it go you will be able to examine fail state of the future when it completes.

Such a solution implies that a future might fail in two different ways. If there is an exception. than f.failed() will be true. While if future is ok you still need to check result for isErr?... I'm not sure how it's gonna make code more clean in general.

For this particular case, of course, if we decide that publish should never raise an exception, that is ok to sign somehow that something went wrong you need to deal with it.

Ivansete-status · 2025-02-06T14:41:45Z

libp2p/protocols/pubsub/pubsubpeer.nim

+  p.observers = observers
+
+proc hasObservers*(p: PubSubPeer): bool =
+  p.observers != nil and anyIt(p.observers[], it != nil)


I wonder why observers needs to be ref?

Ivansete-status · 2025-02-06T14:44:20Z

libp2p/protocols/pubsub/pubsubpeer.nim

      var mm = msg
      # trigger send hooks
-      p.sendObservers(mm)
+      p.beforeSendObservers(mm)


What is the benefit/need of having beforeSendObservers? Wouldn't it suffice by just having the sendObservers moved after the actual sendEncoded call?

Well it all comes from the original implementation where sendObservers were able to modify the messages before sending - peers by peer. Thus we went to slow path even if we did not intended to change anything internal to messages - I can imagine some use cases for some specific use of libp2p but not waku obviously.
So to keep the original option to influence messages so the sendObserver has became two. So user has the freedom to choose, and will definitely allow fast path sending if afterSentObserver is used.

Ivansete-status · 2025-02-06T14:46:01Z

libp2p/protocols/pubsub/pubsubpeer.nim

@@ -495,6 +510,8 @@ proc send*(
    trace "sending msg to peer", peer = p, rpcMsg = shortLog(msg)
    asyncSpawn p.sendEncoded(encoded, isHighPriority)

+  p.afterSentObservers(msg)


I don't know if we can ensure 100% that the msg has been sent at this point. Notice that the sendEncoded spawns an async task.

Maybe better to invoke the afterSentObservers(msg) inside sendEncoded?

richard-ramos · 2025-02-06T15:44:13Z

@NagyZoltanPeter in #1244 I removed the exception being raised in publish as it can't possible be raised unless one of us nim-libp2p devs introduce a bug :)

NagyZoltanPeter · 2025-02-06T18:06:11Z

@NagyZoltanPeter in #1244 I removed the exception being raised in publish as it can't possible be raised unless one of us nim-libp2p devs introduce a bug :)

Yeah, thanks! I will revert my solution to the original design that details errors in Result.

NagyZoltanPeter added 3 commits February 4, 2025 11:11

Adapt pubsub test to changed obeservers

9663163

non-virtual-interface pattern applied for publish interface to keep o…

aa5651b

…riginal behavior and separate new exception based error propagation under publishEx interface

NagyZoltanPeter requested review from richard-ramos, gabrielmer and Ivansete-status February 6, 2025 09:42

NagyZoltanPeter self-assigned this Feb 6, 2025

NagyZoltanPeter mentioned this pull request Feb 6, 2025

feat: lightpush v2 waku-org/nwaku#3279

Draft

3 tasks

arnetheduck reviewed Feb 6, 2025

View reviewed changes

tersec reviewed Feb 6, 2025

View reviewed changes

Ivansete-status reviewed Feb 6, 2025

View reviewed changes

Fix infinite recursive calls in publish, extend observer tests

12cbd26

richard-ramos mentioned this pull request Feb 6, 2025

refactor(pubsub): do not raise exceptions on publish #1244

Merged

chore: enhanced error propagation from publish + enable fast path send even observeres are added #1243

Are you sure you want to change the base?

chore: enhanced error propagation from publish + enable fast path send even observeres are added #1243

Conversation

NagyZoltanPeter commented Feb 6, 2025 • edited Loading

Description

Disclaimer about the interface change.

Optimizing message observers

github-actions bot commented Feb 6, 2025 • edited Loading

arnetheduck commented Feb 6, 2025

Choose a reason for hiding this comment

NagyZoltanPeter Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnetheduck Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NagyZoltanPeter commented Feb 6, 2025

NagyZoltanPeter commented Feb 6, 2025 • edited Loading

update: I found the issue that caused an infinite recursion... my bad.

tersec commented Feb 6, 2025

arnetheduck commented Feb 6, 2025

richard-ramos commented Feb 6, 2025

arnetheduck commented Feb 6, 2025

arnetheduck commented Feb 6, 2025

Ivansete-status left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richard-ramos commented Feb 6, 2025

NagyZoltanPeter commented Feb 6, 2025

NagyZoltanPeter commented Feb 6, 2025 •

edited

Loading

github-actions bot commented Feb 6, 2025 •

edited

Loading

NagyZoltanPeter Feb 6, 2025 •

edited

Loading

arnetheduck Feb 6, 2025 •

edited

Loading

NagyZoltanPeter commented Feb 6, 2025 •

edited

Loading