Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: enhanced error propagation from publish + enable fast path send even observeres are added #1243

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

NagyZoltanPeter
Copy link
Contributor

@NagyZoltanPeter NagyZoltanPeter commented Feb 6, 2025

Description

Nwaku new lightpush protocol enriches the information sent back to its clients in case a problem occurs during relaying messages. This implies enhancing pubsub/gossipsub publish method.
Along with the latest exception indication added in 1.8.0, I decided while keeping the original interface any errors will be propagated via exception to the caller.
Hence new exceptions are defined.

Disclaimer about the interface change.

To avoid breaking change in users of pubsub.publish method - that usually only expects a non-zero return value in case of success and zero in case of any problem occurs.
Therefore old publish interface will digest all exception and translate it zero return.
New publishEx interface is added to be used by waku and for any other heavy user that needs more comprehensive feedback from the operation.

Optimizing message observers

Another shortcoming found and enhanced within this PR. While using any kind of observer on pubsub message relaying forces messages to be encoded as many peers it sends to. It is totally useless if the observer does not want to modify the message. While keeping the option to add an invasive observer before message-send, it is extended to having a view-only after-send observer. This allows fast path send whenever possible.

pubsub/gossipsub publish will raise specific exception according to problem occured during processing the message.
pubsub observers are modified with distinch onBeforeSend and onAfterSent callbacks - this helps keep sending messages on fast path.
(This is needed for nwaku - that has no intent on modifying message before forwarding but still observs them)
…riginal behavior and separate new exception based error propagation under publishEx interface
Copy link

github-actions bot commented Feb 6, 2025

Commits must follow the Conventional Commits specification
Please fix these commit messages:

Fix infinite recursive calls in publish, extend observer tests
non-virtual-interface pattern applied for publish interface to keep original behavior and separate new exception based error propagation under publishEx interface
Adapt pubsub test to changed obeservers

@arnetheduck
Copy link
Contributor

the original sin here is that publish allows raising errors at all - publish is a "broadcast" operation, ie "fire-and-forget" - its nature is such that it cannot fail by definition and thus shouldn't be raising anything at all - if the higher-level logic considers some of these things an error, this function can return a richer set of statistics than "number of peers that wanted the message".

type
PublishingError* = object of LPError

NoTopicSpecifiedError* = object of PublishingError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an API violation, should not be a catchable error

Copy link
Contributor Author

@NagyZoltanPeter NagyZoltanPeter Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get your point here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the caller is responsible for verifying that the topic is not empty - this is not something to raise an exception about (it could potentially be an assert)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pubsub, a "" string is allowed as there's no explicit validation rejecting it even if it semantically makes no sense. In https://github.com/libp2p/specs/blob/master/pubsub/README.md#the-topic-descriptor it is specified names can be descriptive or random or anything that the creator chooses.

Not allowing empty topics should be a concern for the application using libp2p, and we should not enforce this in nim-libp2p.

PublishingError* = object of LPError

NoTopicSpecifiedError* = object of PublishingError
PayloadIsEmptyError* = object of PublishingError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is either an api violation or allowed (I don't remember if empty payloads are allowed by libp2p)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/libp2p/specs/tree/master/pubsub#the-message suggests that at least from libp2p spec perspective t's allowed:

The data (optional) field is an opaque blob of data representing the payload. It can contain any data that the publisher wants it to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will skip publishing anyway, hance this feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions:

  • Is this not basically another logic error in the calling side? If the caller doesn't want to send empty messages, and the caller presumably typically has some framing protocol itself whereby even say "empty" chat messages (or other Waku messages without Waku-application-level-payloads) would from a libp2p perspective have some content; and probably more importantly;

  • it's inexpensively detectable at the caller end, and creates a new exception where one doesn't current exist by involving libp2p in this detection process.

So it's basically another logic error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tersec. The caller is the one having the context to determine whether a message with empty data is valid or not. Adding this validation in publish would mean we're no longer being compliant with the spec.


NoTopicSpecifiedError* = object of PublishingError
PayloadIsEmptyError* = object of PublishingError
DuplicateMessageError* = object of PublishingError
Copy link
Contributor

@arnetheduck arnetheduck Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Publishing an "already known" message is not an error in distributed systems in general - it is expected that it can happen as a consequence of the chosen message identity function because messages can be generated in arbitrary order and sometimes end up being the same (imagine two nodes existing for redundancy and publishing the same data to the network from different entry points into the mesh).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but it is still information for the caller why the message is not relayed, so it might not need to re-try with this peer, maybe not with other peers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but it is still information for the caller why the message is not relayed, so it might not need to re-try with this peer, maybe not with other peers.

When the message is in seen, the message has been relayed - either directly, indirectly or topped up with IHAVE/IWANT.

You can think of publish as "put this message on the pubsub network where it will stay for a period of time as agreed by the seen/cache timeouts" - after that's been done once by any peer for a particular message (as identified by the id function), the network, and not the caller, is responsible for ensuring that all peers get to observe it - it would be incorrect for lightpush to try again and the API of publish should reflect this point so that users of libp2p don't make this mistake in their code.

NoTopicSpecifiedError* = object of PublishingError
PayloadIsEmptyError* = object of PublishingError
DuplicateMessageError* = object of PublishingError
NotSubscribedToTopicError* = object of PublishingError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to be subscribed to publish messages

PayloadIsEmptyError* = object of PublishingError
DuplicateMessageError* = object of PublishingError
NotSubscribedToTopicError* = object of PublishingError
NoPeersToPublishError* = object of PublishingError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is covered by the return value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I choose the way to treat as a failure as your message has not been tried to be relayed at all.
Still there is the possiblity it won't be relayed due to various reasons. That uncertainty is another story to tackle within waku.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not failure from the point of view of pubsub - the message has been put in the pubsub message cache and when a peer connects it will eventually get broadcast. Peers come and go and the system as a whole is designed to handle this particular case - any protocol using pubsub (such as lightpush) must be robust according to these semantics, ie handle them gracefully.

DuplicateMessageError* = object of PublishingError
NotSubscribedToTopicError* = object of PublishingError
NoPeersToPublishError* = object of PublishingError
GeneratingMessageIdError* = object of PublishingError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a programming error - you should not be attempting to publish messages that are invalid by your own message id generation function - the default id generation function cannot fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I used a dumb approach not questioning the valid occurrence of an error just followed the code where it can return log and error and return 0.
I think it is there as message id generator can be replaced and there is no guarantee not getting error from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a buggy or badly-specified message ID function, then.

p.observers = observers

proc hasObservers*(p: PubSubPeer): bool =
p.observers != nil and anyIt(p.observers[], it != nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it allowed to set nil observers within the list? It creates checking for this basically logic-error condition throughout the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Obviously not as it's being initialize with the pubsub instance.
It just had been there from loooong time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why observers needs to be ref?

): Future[int] {.async: (raises: [LPError]), public.} =
try:
return await p.doPublish(topic, data)
except LPError as ex:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general smaller (not-major-design-point), Nim's code generation can/does sometimes create more efficient code when the as ex part of this is excluded, as it simply never materializes the exception as a variable.

So if it's truly not used, better not to create it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely.

@NagyZoltanPeter
Copy link
Contributor Author

@arnetheduck, @tersec : First of all thank you guys looking after it, I really appreciate all and any comments.
Will answer where I can.

The initial story behind is waku lightpush as a non-relay protocol access to the relay network needed some more detailed responses towards client side (aka edge node).
As since last year waku team has been heavily working on message delivery reliability features, this part should allow client side (SDK) to decide upon actions in case a message is not delivered. for some reason. We found it is not enough to say you have got zero peers your message delivered.
Unfortunately, many errors are hidden behind the curtains of the pub-sub interface—even those relating to configuration or client-side errors with the message itself.

There were two restrictions I put first - not to change the interface (...as I checked other libp2p implementations having exact same publish interface) and try not break other clients.
First hand with libp2p v1.8.0 as Richard added raises pragma it turned out that it is possible the publish can raise exception, this gave the option to propagate errors via exceptions. But I also found no-one expects exceptions by using publish(...) ... so it can easily break other users of it.
Hence the try separate publish and publishEx - for more comprehensive use.
(Also to notice I planned other performance enablers for waku on publishEx interface - but not in this PR)

@NagyZoltanPeter
Copy link
Contributor Author

NagyZoltanPeter commented Feb 6, 2025

cc: @arnetheduck , @Ivansete-status, @richard-ramos:
I would appreciate a special review on the last commit:
aa5651b

This tries to utilize a polymorphic approach to separate the interface from internal implementation down the inheritance chain.

But. It turns to a seg-fault while running specific pubsub tests.
Like:
nim c -d:debug --styleCheck:usages --styleCheck:error --verbosity:0 --hints:off --skipUserCfg -f --threads:on --opt:speed -d:libp2p_pubsub_sign=true -d:libp2p_pubsub_verify=true -r tests/pubsub/testgossipsub

running it result an immediate segfault right at start without any stack trace. I may overlooked something however I tried out this solution in a dummy code, where it worked. So I may overlook something more complicated initialization within pubsub/floodsub/gossipsub.

Thank you.

update: I found the issue that caused an infinite recursion... my bad.

@tersec
Copy link
Contributor

tersec commented Feb 6, 2025

In general, #950 discusses this as well.

@arnetheduck
Copy link
Contributor

many errors are hidden behind the curtains of the pub-sub interface

I think the content of this PR may reflect some conceptual misunderstandings with respect to what publish does as specified by libp2p - in particular the points about publishing various forms of invalid messages which suggest bugs in waku (if they are happening) as well as publishing things already in the seen cache which conceptually is not an error.

I see two outcomes from a PR like this:

  • the conceptual points should be explained in the function documentation, for example preconditions like "the message id function must be able to generate a message id for the message you're trying to publish" and semantic points of note ("a message with an id that has already been observed recently on the network will not be republished" and "you do not have to be subscribed to a topic to publish there").
  • the return value does indeed not distinguish between "no peers in mesh" and "message seen already" - this distinction could be worthy of a change / extension to the API because it indeed may represent an interesting point to make
    • as an alternative to changing the API, we can change the return value in the case that the message is in "seen": instead of returning 0, we can return the number of peers in the mesh as an indication of how many peers "reasonably" know about the message (msg in seen indicates that all your neighbours reasonably can be assumed to have observed the message, just like after sending it to them).

@richard-ramos
Copy link
Member

@arnetheduck: the original sin here is that publish allows raising errors at all - publish is a "broadcast" operation, ie "fire-and-forget" - its nature is such that it cannot fail by definition and thus shouldn't be raising anything at all

There's currently an scenario in which publish can raise an LPError, if anonimize is false and we enter this branch of code:

if sign:
raise (ref LPError)(msg: "Cannot sign message without peer info")
This can happen if passing a nil peerInfo when calling newSwitch. I identified this while adding the raises annotations in the pubsub module.

If SwitchBuilder is used, this scenario will not happen, because build will populate the peerInfo. So maybe we can instead of raising the LPError, to doAssert(false, "Cannot sign message without peer info"), as running into this is likely a misconfiguration?

@arnetheduck
Copy link
Contributor

as running into this is likely a misconfiguration?

yeah - that's probably reasonable - or we can log it, or we can maybe check for it during startup ..

The risk we want to avoid is that libp2p crashes in potentially "uncommon" scenarios when it already looks like it's "up and running" - if we add an assert here, we should also add a "regular" error somewhere else so that it (ideally) becomes hard or impossible to reach the point of trying to publish messages like this.

@arnetheduck
Copy link
Contributor

ie the intent in past discussions about publish was that it should never raise anything - that you discovered a raise is testament to the efficacy of adding raises annotations to enforce things ;)

Copy link
Contributor

@Ivansete-status Ivansete-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for it! 🙌
Some comments so far

g: GossipSub, topic: string, data: seq[byte]
): Future[int] {.async: (raises: [LPError]).} =
logScope:
topic

if topic.len <= 0: # data could be 0/empty
debug "Empty topic, skipping publish"
return 0
raise newException(NoTopicSpecifiedError, "Empty topic, skipping publish")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using Result instead of exceptions. Applies elsewhere.
I also think we don't need new error types.

I'd rather go for something much simpler like:

Future[Result[int, string]]

I don't care the type of error. To me, it is only interesting that an error happened and I can get the detail directly from the underlying string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we had this debate and we can argue about it different ways.
It is clear the Future is designed to be able to handle different failure cases.
One is that is capable to catch exceptions and report it to the listener on the other end as error.
This is because async task might not get waited where it has been launched. So exception handling might detach.
If you do await it, it will be raised so you need to catch it, while if you just let it go you will be able to examine fail state of the future when it completes.

Such a solution implies that a future might fail in two different ways. If there is an exception. than f.failed() will be true. While if future is ok you still need to check result for isErr?... I'm not sure how it's gonna make code more clean in general.

For this particular case, of course, if we decide that publish should never raise an exception, that is ok to sign somehow that something went wrong you need to deal with it.

p.observers = observers

proc hasObservers*(p: PubSubPeer): bool =
p.observers != nil and anyIt(p.observers[], it != nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why observers needs to be ref?

var mm = msg
# trigger send hooks
p.sendObservers(mm)
p.beforeSendObservers(mm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit/need of having beforeSendObservers? Wouldn't it suffice by just having the sendObservers moved after the actual sendEncoded call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it all comes from the original implementation where sendObservers were able to modify the messages before sending - peers by peer. Thus we went to slow path even if we did not intended to change anything internal to messages - I can imagine some use cases for some specific use of libp2p but not waku obviously.
So to keep the original option to influence messages so the sendObserver has became two. So user has the freedom to choose, and will definitely allow fast path sending if afterSentObserver is used.

@@ -495,6 +510,8 @@ proc send*(
trace "sending msg to peer", peer = p, rpcMsg = shortLog(msg)
asyncSpawn p.sendEncoded(encoded, isHighPriority)

p.afterSentObservers(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we can ensure 100% that the msg has been sent at this point. Notice that the sendEncoded spawns an async task.

Maybe better to invoke the afterSentObservers(msg) inside sendEncoded?

@richard-ramos
Copy link
Member

@NagyZoltanPeter in #1244 I removed the exception being raised in publish as it can't possible be raised unless one of us nim-libp2p devs introduce a bug :)

@NagyZoltanPeter
Copy link
Contributor Author

@NagyZoltanPeter in #1244 I removed the exception being raised in publish as it can't possible be raised unless one of us nim-libp2p devs introduce a bug :)

Yeah, thanks! I will revert my solution to the original design that details errors in Result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: new
Development

Successfully merging this pull request may close these issues.

5 participants