Skip to content

Bug: count in defined activity produces count-1 events (one less than expected) #222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alexzaitsev opened this issue Feb 24, 2025 · 16 comments · May be fixed by #235
Open

Bug: count in defined activity produces count-1 events (one less than expected) #222

alexzaitsev opened this issue Feb 24, 2025 · 16 comments · May be fixed by #235
Assignees
Labels
bug Something isn't working
Milestone

Comments

@alexzaitsev
Copy link
Contributor

alexzaitsev commented Feb 24, 2025

Describe the bug

count in defined activity produces one event less than expected. For example, this activity transfers 75 satoshi instead of expected 100:

        {
            "source": "alice",
            "destination": "grace",
            "interval_secs": 1,
            "amount_msat": 25000,
            "count": 4
        }

To Reproduce

Steps to reproduce the behavior:

  1. Checking balance on grace (receiver): "balance": "2625".
  2. Running sim-cli (with more logs in produce_events, no functionality changed):
2025-02-24T00:44:31.404Z INFO  [sim_cli] Connected to alice - Node ID: 020abed80867adb94d6d48748d8c9b22d64401b458f93d65644d60e4c9a2a361ba.
2025-02-24T00:44:31.427Z INFO  [sim_cli] Connected to judy - Node ID: 0367c0f31a850862bdeb5e6c41d6bda200831022d60841bc79a5bc0b5ed62c9c63.
2025-02-24T00:44:31.451Z INFO  [sim_cli] Connected to carol - Node ID: 03b7a5cafdbd65eb3a05eeb1740fa1773c3a2bd697f03af93e14996aa75b142dcb.
2025-02-24T00:44:31.475Z INFO  [sim_cli] Connected to erin - Node ID: 02f59712b098d1ca2342d6da800e756e3e97bda93e6a59bc5a9fc25ce8d09d0dc5.
2025-02-24T00:44:31.498Z INFO  [sim_cli] Connected to ivan - Node ID: 03029b147528c3efd367985bf6f601d04374d843bf808ac2e21919955b8e5edec2.
2025-02-24T00:44:31.521Z INFO  [sim_cli] Connected to grace - Node ID: 02a6da494f563088e7e498aed58197cd342496710e2b85d1240054f1caffcf1708.
2025-02-24T00:44:31.521Z INFO  [simln_lib] Running the simulation forever.
2025-02-24T00:44:31.636Z INFO  [simln_lib] Simulation is running on regtest.
2025-02-24T00:44:31.636Z INFO  [simln_lib] Simulating 1 activity on 6 nodes.
2025-02-24T00:44:31.636Z INFO  [simln_lib] Summary of results will be reported every 60s.
2025-02-24T00:44:31.636Z INFO  [simln_lib] Starting activity producer for alice(020abe...a361ba): static payment of 25000 to grace(02a6da...cf1708) every 1s.
===current_count=0
===target_count=4
Generated payment: alice(020abe...a361ba) -> grace(02a6da...cf1708): 25000 msat.
send result: Ok(())
===current_count=1
===target_count=4
Generated payment: alice(020abe...a361ba) -> grace(02a6da...cf1708): 25000 msat.
send result: Ok(())
===current_count=2
===target_count=4
Generated payment: alice(020abe...a361ba) -> grace(02a6da...cf1708): 25000 msat.
send result: Ok(())
===current_count=3
===target_count=4
Generated payment: alice(020abe...a361ba) -> grace(02a6da...cf1708): 25000 msat.
send result: Ok(())
===current_count=4
===target_count=4
2025-02-24T00:44:35.645Z INFO  [simln_lib] Payment count has been met for alice(020abe...a361ba): 4 payments. Stopping the activity.
2025-02-24T00:44:35.645Z INFO  [simln_lib] All producers finished. Shutting down.
  1. Checking balance again: "balance": "2700",.
    As we see the balance is 2700 but the expected balance is 2725.
  2. Check result report:
source,destination,amount_msat,hash,dispatch_time,htlc_count,payment_outcome
020abed80867adb94d6d48748d8c9b22d64401b458f93d65644d60e4c9a2a361ba,02a6da494f563088e7e498aed58197cd342496710e2b85d1240054f1caffcf1708,25000,f1b9a8fe1548fbc98ecf0e5fe01861607f5e26ba47e6992235a141d4db599291,1740358196162,1,Success
020abed80867adb94d6d48748d8c9b22d64401b458f93d65644d60e4c9a2a361ba,02a6da494f563088e7e498aed58197cd342496710e2b85d1240054f1caffcf1708,25000,2579008af17feb6b4cc7af439e1e594739ca42a83c840ac45f5a78c903bb4ca4,1740358197163,1,Success
020abed80867adb94d6d48748d8c9b22d64401b458f93d65644d60e4c9a2a361ba,02a6da494f563088e7e498aed58197cd342496710e2b85d1240054f1caffcf1708,25000,2d20a3cf8f9d0aa519dec58d9dcbe378554c9fa72317fccef0d5221a130ec3ff,1740358198165,1,Success

There are 3 events (4 expected).

Config

{
    "nodes": [
        {
            "id": "alice",
            "address": "127.0.0.1:10001",
            "cert": "/Users/p3ngu1n/.polar/networks/1/volumes/lnd/alice/tls.cert",
            "macaroon": "/Users/p3ngu1n/.polar/networks/1/volumes/lnd/alice/data/chain/bitcoin/regtest/admin.macaroon"
        },
        {
            "id": "judy",
            "address": "127.0.0.1:10010",
            "cert": "/Users/p3ngu1n/.polar/networks/1/volumes/lnd/judy/tls.cert",
            "macaroon": "/Users/p3ngu1n/.polar/networks/1/volumes/lnd/judy/data/chain/bitcoin/regtest/admin.macaroon"
        },
        {
            "id": "carol",
            "address": "127.0.0.1:10003",
            "cert": "/Users/p3ngu1n/.polar/networks/1/volumes/lnd/carol/tls.cert",
            "macaroon": "/Users/p3ngu1n/.polar/networks/1/volumes/lnd/carol/data/chain/bitcoin/regtest/admin.macaroon"
        },
        {
            "id": "erin",
            "address": "127.0.0.1:10005",
            "cert": "/Users/p3ngu1n/.polar/networks/1/volumes/lnd/erin/tls.cert",
            "macaroon": "/Users/p3ngu1n/.polar/networks/1/volumes/lnd/erin/data/chain/bitcoin/regtest/admin.macaroon"
        },
        {
            "id": "ivan",
            "address": "127.0.0.1:10009",
            "cert": "/Users/p3ngu1n/.polar/networks/1/volumes/lnd/ivan/tls.cert",
            "macaroon": "/Users/p3ngu1n/.polar/networks/1/volumes/lnd/ivan/data/chain/bitcoin/regtest/admin.macaroon"
        },
        {
            "id": "grace",
            "address": "127.0.0.1:10007",
            "cert": "/Users/p3ngu1n/.polar/networks/1/volumes/lnd/grace/tls.cert",
            "macaroon": "/Users/p3ngu1n/.polar/networks/1/volumes/lnd/grace/data/chain/bitcoin/regtest/admin.macaroon"
        }
    ],
    "activity": [
        {
            "source": "alice",
            "destination": "grace",
            "interval_secs": 1,
            "amount_msat": 25000,
            "count": 4
        }
    ]
}

I inspected produce_events function but didn't see anything unusual.

@alexzaitsev alexzaitsev added the bug Something isn't working label Feb 24, 2025
@alexzaitsev alexzaitsev changed the title count in defined activity produces count-1 events (one less than expected) Bug: count in defined activity produces count-1 events (one less than expected) Feb 27, 2025
@carlaKC
Copy link
Contributor

carlaKC commented Mar 3, 2025

@bjohnson5 do you have time to take a look at this?

@bjohnson5
Copy link
Collaborator

@bjohnson5 do you have time to take a look at this?

Sure, I think I have also noticed this before. I will take a look.

@bjohnson5 bjohnson5 self-assigned this Mar 3, 2025
@bjohnson5
Copy link
Collaborator

I think this is happening because the produce_events function returns (and therefore simln shuts down) immediately when the payment count is met. This causes the consume_events function to shutdown before it processes the last payment event.

In theory, this could also happen with the simulation time feature. The producer could send an event, then the time limit could hit and cause a shutdown before the consumer processes the event.

NOTE: If you have more than one payment activity defined, only the last running activity loses one payment. This further confirms that it is a race condition between the shutdown trigger and the events consumer.

@carlaKC Any thoughts on an approach to avoid this situation? Should the consumers make sure to empty their queue of events before shutting down?

@carlaKC
Copy link
Contributor

carlaKC commented Mar 4, 2025

Should the consumers make sure to empty their queue of events before shutting down?

This sounds like a good approach, I believe it's best practice for channels to drain the receiver anyway so worth looking into!

@bjohnson5
Copy link
Collaborator

Should the consumers make sure to empty their queue of events before shutting down?

This sounds like a good approach, I believe it's best practice for channels to drain the receiver anyway so worth looking into!

So it looks like it is slightly more complicated than just draining all the receivers because of how the different channels are chained together. consume_events receives events from produce_events and then sends the outcome to produce_simulation_results.

produce_events -> consume_events -> produce_simulation_results -> consume_simulation_results

If we make consume_events, produce_simulation_results, and consume_simulation_results drain their receivers on shutdown... consume_events will process the last event but produce_simulation_results and consume_simulation_results will not have any pending events (because consume_events is still processing the last event) and will therefore still shutdown.

I think we might need a way to track "in progress" payments and make sure to not shutdown until all "in progress" payments have been completed and recorded in the simulation results.

@carlaKC
Copy link
Contributor

carlaKC commented Mar 6, 2025

If we make consume_events, produce_simulation_results, and consume_simulation_results drain their receivers on shutdown... consume_events will process the last event but produce_simulation_results and consume_simulation_results will not have any pending events (because consume_events is still processing the last event) and will therefore still shutdown

Hm interesting. I think that the fundamental issue here is the combination of using triggered and channels at the same time. The correct way of doing things would be to rely on channel semantics (error on close) to exit so that each of the consumers. We have the combination because we were rushing when we built this out in a hackathon.

I think that something like this could work:

  • All tasks have a trigger to signal when they get an unexpected error
  • Only produce_events consumes on listen, and exits when it finished with its count or an unexpected error occurs elsewhere
  • When produce_events exits, all senders are dropped so consumer_events receiver should error out (once it's been trained, receiving on a closed channel will fail) and the impact will cascade down the tasks, each only exiting once they've fully completed.

This can certainly get hairy (which is why we cheated and added triggered everywhere originally), so a change would need to come with some tests to make sure we're shutting down properly. One issue that we could hit is that payments could take a while to complete (discussed here). Contrary to my opinion expressed in that issue, I think we can proceed with this approach and assume that payments will resolve fairly quickly and if it becomes an issue we can add a --exit-on-error option that doesn't wait and just shuts down immediately.

I think we might need a way to track "in progress" payments and make sure to not shutdown until all "in progress" payments have been completed and recorded in the simulation results.

I'm hesitant to introduce global state in a channel-oriented environment - usually a recipe for deadlock soup.

Thanks for the report @alexzaitsev! Good to be reminded of these larger architectural improvements that we can make.

@bjohnson5
Copy link
Collaborator

If we make consume_events, produce_simulation_results, and consume_simulation_results drain their receivers on shutdown... consume_events will process the last event but produce_simulation_results and consume_simulation_results will not have any pending events (because consume_events is still processing the last event) and will therefore still shutdown

Hm interesting. I think that the fundamental issue here is the combination of using triggered and channels at the same time. The correct way of doing things would be to rely on channel semantics (error on close) to exit so that each of the consumers. We have the combination because we were rushing when we built this out in a hackathon.

I think that something like this could work:

  • All tasks have a trigger to signal when they get an unexpected error
  • Only produce_events consumes on listen, and exits when it finished with its count or an unexpected error occurs elsewhere
  • When produce_events exits, all senders are dropped so consumer_events receiver should error out (once it's been trained, receiving on a closed channel will fail) and the impact will cascade down the tasks, each only exiting once they've fully completed.

This can certainly get hairy (which is why we cheated and added triggered everywhere originally), so a change would need to come with some tests to make sure we're shutting down properly. One issue that we could hit is that payments could take a while to complete (discussed here). Contrary to my opinion expressed in that issue, I think we can proceed with this approach and assume that payments will resolve fairly quickly and if it becomes an issue we can add a --exit-on-error option that doesn't wait and just shuts down immediately.

I think we might need a way to track "in progress" payments and make sure to not shutdown until all "in progress" payments have been completed and recorded in the simulation results.

I'm hesitant to introduce global state in a channel-oriented environment - usually a recipe for deadlock soup.

Thanks for the report @alexzaitsev! Good to be reminded of these larger architectural improvements that we can make.

Ok yes, this makes sense. It will require some careful changes though. I will start working through it and report back!

@bjohnson5
Copy link
Collaborator

@carlaKC I have been thinking about how we could handle the situation that you brought up about payments that take a long time to complete... The --exit-on-error would work, but it is possible that a user did not know before run time that the payment would get stuck. In that case they have to figure out how to kill the process, deal with any clean up, and then try again with the --exit-on-error option.

What if we just introduce an abort functionality along with a normal shutdown function? The produce_events task would be the only task listening for the shutdown trigger and all other tasks listen for an abort trigger.

If the payment count is met or the time limit is hit, a shutdown is triggered. This causes the produce_events tasks to stop and all the channel receivers in the other tasks get cleanly stopped after draining. However, if at this point we are stuck tracking a long running payment status the user could use ctrl-c to abort and all the tasks would stop.

This essentially keeps the functionality the exact same as it is right now for ctrl-c (kills everything at once, which guarantees the shutdown but not all payments completing), but for a payment count or time limit shutdown it tries to let everything finish using the channel semantics (guaranteeing that all payments complete, and if not the ctrl-c option is there to use as a fail-safe).

@carlaKC @alexzaitsev Let me know what you think this approach. I put up a draft PR (#230) but I still need to test and clean it up.

bjohnson5 added a commit to bjohnson5/sim-ln that referenced this issue Mar 17, 2025
bjohnson5 added a commit to bjohnson5/sim-ln that referenced this issue Mar 17, 2025
@carlaKC
Copy link
Contributor

carlaKC commented Mar 17, 2025

The --exit-on-error would work, but it is possible that a user did not know before run time that the payment would get stuck. In that case they have to figure out how to kill the process, deal with any clean up, and then try again with the --exit-on-error option.

Good point.

What if we just introduce an abort functionality along with a normal shutdown function? The produce_events task would be the only task listening for the shutdown trigger and all other tasks listen for an abort trigger.

I'm hesitant to add a second trigger, because our shutdown conditions are quite complicated as-is. If we're going to do a big refactor here, I think it's worth trying to do things "canonically" for rust channels.

Only produce_events consumes on listen, and exits when it finished with its count or an unexpected error occurs elsewhere

My statement here is wrong, I think. We need to listen for our shutdown everywhere we have a Sender channel type and we're not also consuming from a Receiver. Intuitively this makes sense to me, since that's the "top" of our channel pipeline.

From a quick scan of the codebase, this means that we need to listen for shutdown on listener in:

  • produce_events
  • track_payment_result

I believe works in the case of a payment taking a long time, because track_payment_results would just exit on shutdown rather than waiting for track_payment.await. It also seems more aligned with the shutdown advice in mpsc docs.

@bjohnson5
Copy link
Collaborator

The --exit-on-error would work, but it is possible that a user did not know before run time that the payment would get stuck. In that case they have to figure out how to kill the process, deal with any clean up, and then try again with the --exit-on-error option.

Good point.

What if we just introduce an abort functionality along with a normal shutdown function? The produce_events task would be the only task listening for the shutdown trigger and all other tasks listen for an abort trigger.

I'm hesitant to add a second trigger, because our shutdown conditions are quite complicated as-is. If we're going to do a big refactor here, I think it's worth trying to do things "canonically" for rust channels.

Only produce_events consumes on listen, and exits when it finished with its count or an unexpected error occurs elsewhere

My statement here is wrong, I think. We need to listen for our shutdown everywhere we have a Sender channel type and we're not also consuming from a Receiver. Intuitively this makes sense to me, since that's the "top" of our channel pipeline.

From a quick scan of the codebase, this means that we need to listen for shutdown on listener in:

  • produce_events
  • track_payment_result

I believe works in the case of a payment taking a long time, because track_payment_results would just exit on shutdown rather than waiting for track_payment.await. It also seems more aligned with the shutdown advice in mpsc docs.

I think the fact that we are listening for shutdown in track_payment_result is one reason the count is always one off. Because produce_events will complete after sending the last payment and then trigger shutdown, so track_payment_result quits immediately and does not process the last payment. However, we could stop triggering the shutdown from produce_events and let that close everything with the channels. Then to manually stop a long running sim, ctrl-c would trigger the shutdown.

Is that what you had in mind?

@carlaKC
Copy link
Contributor

carlaKC commented Mar 20, 2025

I think the fact that we are listening for shutdown in track_payment_result is one reason the count is always one off.

Isn't the problem that the consume_events loop didn't pick it up and dispatch it on the lightning node? Per your comment here.

Rather than quit immediately, track_payment_result can just sent a TrackPaymentFailed PaymentOutcome if we trigger the listener and then exit cleanly. This way the rest of the pipeline still gets a result and just reports that it fails, and all of the Sender channels get dropped.

@bjohnson5
Copy link
Collaborator

I think the fact that we are listening for shutdown in track_payment_result is one reason the count is always one off.

Isn't the problem that the consume_events loop didn't pick it up and dispatch it on the lightning node? Per your comment here.

They are both part of the problem. Even if consume_events picks it up and sends it you also need track_payment_result to stay alive long enough to track the payment and record the result.

Rather than quit immediately, track_payment_result can just sent a TrackPaymentFailed PaymentOutcome if we trigger the listener and then exit cleanly. This way the rest of the pipeline still gets a result and just reports that it fails, and all of the Sender channels get dropped.

Yes, but what if the payment that was sent actually was successful (the nodes balances change). Then we would have a results report that shows a failed payment but the underlying nodes would have seen a successful payment.

There are several different cases that all need to be handled:

  1. A shutdown because the payment count was met
  2. A shutdown because the time limit was met
  3. A shutdown caused by ctrlc

Case 1 the producers will stop after the correct number of payments, causing all of the channel receivers to drain and quit. We do still need a way to stop the results logger thread though.

Case 2 needs a way to shutdown the producers, but not the consumers

Case 3 is the simplest. Just exit everything without a guarantee that the last payment will be correct.

@carlaKC
Copy link
Contributor

carlaKC commented Mar 20, 2025

They are both part of the problem. Even if consume_events picks it up and sends it you also need track_payment_result to stay alive long enough to track the payment and record the result.

Yes ofc, thanks for the walkthrough of options!

Yes, but what if the payment that was sent actually was successful (the nodes balances change). Then we would have a results report that shows a failed payment but the underlying nodes would have seen a successful payment.

Hm yup, that's tricky. What if we add a sane (+ configurable) time limit to track_payment_result?
Eg, once track_payment_result hears a signal from listener it gives the payment 10 seconds to complete and if it doesn't it just gives a ResultNotTracked error instead of waiting for it to finish?

We could probably account for this in our result logging, saying - hey we tracked 4 payments and then the 5th one took too long so we don't know, update this wait parameter if you need to know this.

There are several different cases that all need to be handled:

Putting that all together (mostly for my own good to understand if it works)

  • produce_events exits or gets a shutdown trigger from listener, dropping is sender channel
  • shutdown is triggered because we have no more producers left
  • consume_events drains its receiver then errors out because produce_events(sender) being dropped closes the receiver
  • consume_events drops its sender when it exits
  • produce_simulation_results will read everything from event_receiver and then exit, because consume_events(sender) dropped
  • track_payment_result learns about shutdown from listener and starts a timeout for the payment, sending a ResultNotTracked error if it doesn't hear back in time
  • Once all the track_payment_result tasks and produce_simulation_results have exited, consume_simulation_results will error out because its receiver channel has no senders left

I think this covers all of our cases?

@bjohnson5
Copy link
Collaborator

They are both part of the problem. Even if consume_events picks it up and sends it you also need track_payment_result to stay alive long enough to track the payment and record the result.

Yes ofc, thanks for the walkthrough of options!

Yes, but what if the payment that was sent actually was successful (the nodes balances change). Then we would have a results report that shows a failed payment but the underlying nodes would have seen a successful payment.

Hm yup, that's tricky. What if we add a sane (+ configurable) time limit to track_payment_result? Eg, once track_payment_result hears a signal from listener it gives the payment 10 seconds to complete and if it doesn't it just gives a ResultNotTracked error instead of waiting for it to finish?

We could probably account for this in our result logging, saying - hey we tracked 4 payments and then the 5th one took too long so we don't know, update this wait parameter if you need to know this.

There are several different cases that all need to be handled:

Putting that all together (mostly for my own good to understand if it works)

  • produce_events exits or gets a shutdown trigger from listener, dropping is sender channel
  • shutdown is triggered because we have no more producers left
  • consume_events drains its receiver then errors out because produce_events(sender) being dropped closes the receiver
  • consume_events drops its sender when it exits
  • produce_simulation_results will read everything from event_receiver and then exit, because consume_events(sender) dropped
  • track_payment_result learns about shutdown from listener and starts a timeout for the payment, sending a ResultNotTracked error if it doesn't hear back in time
  • Once all the track_payment_result tasks and produce_simulation_results have exited, consume_simulation_results will error out because its receiver channel has no senders left

I think this covers all of our cases?

Yeah I think that would work! Only question is where is the best place to put that timer. The track_payment_results function does not really do the listening, it passes the listener to the cln.rs and lnd.rs track_payment function. I don't love making each implementation handle that timeout after a shutdown individually.

@carlaKC
Copy link
Contributor

carlaKC commented Mar 20, 2025

don't love making each implementation handle that timeout after a shutdown individually

No definitely not, but we can make this a select! to avoid that?

One (related) thing that I haven't thought about is how those track_payment calls on LND (or other impl) get cancelled, but I assume that the GPRC context is cancelled when the client is dropped 🤔

@bjohnson5
Copy link
Collaborator

Ok new PR! I closed the old PR #230 and opened PR #235. This one does the following:

  • Adds a scope block around the section of internal_run that contains the event senders, so that they will be dropped when they go out of scope. This is necessary for all the receivers to drop and exit.

  • Removed the tokio::select! statement and shutdown listener from consume_simulation_results because this is a channel receiver and will close when all of its corresponding senders are dropped.

  • Removed the tokio::select! statement and shutdown listener from consume_events because this is a channel receiver and will close when all of its corresponding senders are dropped.

  • Removed the tokio::select! statement and shutdown listener from produce_simulation_results because this is a channel receiver and will close when all of its corresponding senders are dropped.

  • Refactored track_payment_result to use a timer to wait a given amount of time for current tracking to complete. This function is also now responsible for shutting down the node implementations that are tracking payments (node.track_payment) with a local trigger/listener.

The new shutdown flow looks like this:

  1. The shutdown is triggered by either A.) The producer tasks finish due to payment count being met, B.) The total time has been met, or C.) The user presses ctrlc
  2. The run_results_logger gets the shutdown signal and returns.
  3. The produce_events function gets the shutdown signal (if it was not the one to trigger it) and returns. This drops its sender.
  4. If simln is in the middle of tracking a payment, the track_payment_result function gets the shutdown signal and starts a timer.
  5. The consume_events function quits its recv (because the produce_events dropped its sender), and then drops the consume_events sender.
  6. The produce_simulation_results quits its recv (because consume_events dropped its sender), and then drops the produce_simulation_results sender.
  7. The consume_simulation_results quits its recv because produce_simulation_results dropped its sender.
  8. If simln is in the middle of tracking a payment, either the timer expires and it shuts down or the payments finishes getting tracked and it shuts down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants