Skip to content

Fix nagle, other minor optimizations #8222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented Apr 9, 2025

I noticed we were not going as fast as we should, in a 100-payments test. Turns out our Nagle-suppressing code was wrong!

Then I continued with minor optimization until our profiling basically only contained sqlite operations, resulting in an 8% speedup in bouncing payments. (when eatmydata was used).

(Dropped patch which speeds CI using eatmydata, as it broke too much!)

@rustyrussell rustyrussell added this to the v25.05 milestone Apr 9, 2025
@rustyrussell rustyrussell changed the title Fix nagle, other minor optimizations Fix nagle, other minor optimizations, speed CI Apr 29, 2025
@rustyrussell rustyrussell changed the title Fix nagle, other minor optimizations, speed CI Fix nagle, other minor optimizations May 1, 2025
@rustyrussell rustyrussell force-pushed the guilt/fix-tcp-cork branch 8 times, most recently from aca755f to f9feb66 Compare May 6, 2025 04:28
Once we speed things up, we can *receive* commitment_signed before we disconnect
due to sending ours: in this case, we *will* have a scratch tx.

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the guilt/fix-tcp-cork branch from f9feb66 to 5ce79fb Compare May 6, 2025 05:29
Once we speed things up, this can receive COMMITMENT_SIGNED, and hence succeed.  Let's
drop on receive, not send!

```
2025-05-06T06:07:51.5997333Z         # Peers try RBF, break on initial COMMITMENT_SIGNED
2025-05-06T06:07:51.5997855Z         bump = l1.rpc.openchannel_bump(chan_id, chan_amount, initpsbt['psbt'])
2025-05-06T06:07:51.5998214Z >       with pytest.raises(RpcError):
2025-05-06T06:07:51.5998542Z E       Failed: DID NOT RAISE <class 'pyln.client.lightning.RpcError'>
```

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the guilt/fix-tcp-cork branch 2 times, most recently from ddf0fb1 to 82a2375 Compare May 6, 2025 06:56
Presumably we want to wait for htlcs to be all resolved here.  With the speedup,
this failed under Postgres:

```
>       wait_for(lambda: only_one(l2.rpc.listpeerchannels(l3.info['id'])['channels'])['htlcs'] != [])

tests/test_closing.py:4191:
```

Signed-off-by: Rusty Russell <[email protected]>
Our CORK logic was wrong, and it's better to use Nagle anyway:
we disable Nagle just before sending timing-critical messages.

Time for 100 (failed) payments:

Before:
	148.8573575

After:
	10.7356977

Note this revealed a timing problem in test_reject_invalid_payload: we would
miss the cause of the sendonion failure, and waitsendpay would be called
*after* it had failed, so simply returns "Payment failure reason unknown".

Signed-off-by: Rusty Russell <[email protected]>
Changelog-Fixed: Protocol: Removed 200ms latency from sending commit/revoke messages.
To run, use:
	VALGRIND=0 TEST_BENCH=1 eatmydata pytest tests/test_connection.py::test_bench

Example of running on my laptop (without --enable-debugbuild, over of 10 runs):

	FAILED tests/test_connection.py::test_bench - assert 47.403406-49.191765(48.343+/-0.48) == 0

With eatmydata:

	FAILED tests/test_connection.py::test_bench - assert 25.867860-27.549878(26.5762+/-0.5) == 0

You can also run perf on l1 once it's running:

	perf record --call-graph dwarf -q -p $(cat /tmp/ltests-*/test_bench_1/lightning-1/lightningd-regtest.pid)

Then ^C after 10 seconds and run "perf report".

Things which stood out:

1. Tracing in db_exec_prepared_v2 (fixed in another PR)

   31.12%     0.04%  lightningd  lightningd            [.] db_exec_prepared_v2
   - 31.08% db_exec_prepared_v2
      + 22.96% db_sqlite3_exec
      + 4.46% trace_span_end
      + 1.77% trace_span_start
      + 1.11% trace_span_tag
      + 0.72% tal_free

2. Logging:

   - 16.03% logv
      - 8.15% maybe_print
         - log_to_files
            + 4.51% __GI__IO_fflush (inlined)
            + 1.97% tal_fmt_
            + 0.51% __GI___strftime_l (inlined)

3. Notification (when nothing is listening) in notify_log:

      - 6.84% maybe_notify_log
         - notify_log
            + 3.37% notify_send
            + 1.75% notify_start
            + 1.71% log_notification_serialize
        0.56% new_log_entry

Signed-off-by: Rusty Russell <[email protected]>
…hecking.

Before:
92.26076173782349
91.9576666355133
90.92732524871826

After:
90.5989830493927
88.10309219360352
90.07689118385315
If nobody is subscribed, have notify_start return NULL and the caller
can skip serialization.  This is particularly useful for the "log"
notification which can get called a lot.

Signed-off-by: Rusty Russell <[email protected]>
This means we don't have to iterate through all plugins, making
our "do we even have to construct this notification" optimization
much more efficient.

Signed-off-by: Rusty Russell <[email protected]>
This speeds logging slightly, but most of the time is actually
spent in fflush() (which we need).

Result (10 runs, eatmydata):
	FAILED tests/test_connection.py::test_bench - assert 24.086638-25.347475(24.6901+/-0.4) == 0

This is an 8% performance improvement (over the entire series), which is not bad.

Signed-off-by: Rusty Russell <[email protected]>
If there are pending channel announcements, they'll look like a leak unless we scan into
the maps.

```
lightningd-2 2025-05-01T07:27:03.922Z **BROKEN** gossipd: MEMLEAK: 0x60d000000478
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:   label=gossipd/gossmap_manage.c:595:struct pending_cannounce
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:   alloc:
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/ccan/ccan/tal/tal.c:488 (tal_alloc_)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/gossipd/gossmap_manage.c:595 (gossmap_manage_channel_announcement)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/gossipd/gossipd.c:205 (handle_recv_gossip)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/gossipd/gossipd.c:300 (connectd_req)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/common/daemon_conn.c:35 (handle_read)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/ccan/ccan/io/io.c:60 (next_plan)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/ccan/ccan/io/io.c:422 (do_plan)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/ccan/ccan/io/io.c:439 (io_ready)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/ccan/ccan/io/poll.c:455 (io_loop)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/gossipd/gossipd.c:660 (main)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     ../sysdeps/nptl/libc_start_call_main.h:58 (__libc_start_call_main)
lightningd-2 2025-05-01T07:27:03.924Z **BROKEN** gossipd:     ../csu/libc-start.c:392 (__libc_start_main_impl)
lightningd-2 2025-05-01T07:27:03.924Z **BROKEN** gossipd:   parents:
lightningd-2 2025-05-01T07:27:03.924Z **BROKEN** gossipd:     gossipd/gossmap_manage.c:475:struct gossmap_manage
```

Signed-off-by: Rusty Russell <[email protected]>
We need to make sure the lease starts a the block we expect:

```
        # Note that l3 has the whole lease delay (minus blocks already mined)
        _, _, l3blocks = l3.wait_for_onchaind_tx('OUR_DELAYED_RETURN_TO_WALLET',
                                                 'OUR_UNILATERAL/DELAYED_OUTPUT_TO_US')
>       assert l3blocks == 4032 - 6 - 2 - 1
E       assert 4022 == (((4032 - 6) - 2) - 1)

tests/test_closing.py:985: AssertionError
```

Signed-off-by: Rusty Russell <[email protected]>
This significantly slows down the cycle when we have flakes (it's good if
the tree is completely broken though!).

Signed-off-by: Rusty Russell <[email protected]>
It's disabled.

Signed-off-by: Rusty Russell <[email protected]>
We don't care about data persistence, and this suppresses all sync calls (speeding
sqlite3 specifically).

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the guilt/fix-tcp-cork branch from 82a2375 to db68c47 Compare May 6, 2025 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant