-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reapply 8644 on 9260 #9313
base: yy-beat-itest-optimize
Are you sure you want to change the base?
Reapply 8644 on 9260 #9313
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
81e4188
to
9452cf8
Compare
Looks like a postgres itest failed. Looking into it. Doesn't appear to be related to the btcwallet deadlock fixed recently, so that's good! |
It looks like the shutdown came too early and the sweeper in This seems to happen rarely enough that I can't seem to reproduce it locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! I think we just need some comments and perhaps some direct coverage for the batch
package, then it's good to go!
// sqldb retry and still re-execute the | ||
// failing request individually. | ||
dbErr := sqldb.MapSQLError(err) | ||
if !sqldb.IsSerializationError(dbErr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It'd be nice to cover batch with a simple unit test to make sure the serialization errors are correctly handled and we don't regress later.
Makefile
Outdated
# each can run concurrently. Note that many of the settings here are | ||
# specifically for integration testing and are not fit for running | ||
# production nodes. | ||
docker run --name lnd-postgres -e POSTGRES_PASSWORD=postgres -p 6432:5432 -d postgres:13-alpine -N 1500 -c max_pred_locks_per_transaction=1024 -c max_locks_per_transaction=128 -c jit=off -c work_mem=8MB -c checkpoint_timeout=10min -c enable_seqscan=off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does settings work_mem=8MB
and jit=off
add to the test case stability? If yes could you please add some comments why these were changed (along with other params)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got those settings from @djkazic's suggestions. I think it's likely they're unnecessary for the itests because the databases they're working with are pretty small. Will try running without them to see how it goes, and add comments for the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for context those are the settings I'm using in my production postgres backed LND. DB size is ~5GB. jit=off
should help in all scenarios as JIT'ing mostly benefits long-running queries whereas postgres_kvdb tends to spawn many small and fast-executing queries. work_mem=8MB
on the other hand benefits larger DBs as it gives postgres more breathing room for each individual query to do sorts etc.
9452cf8
to
a94aa1c
Compare
Updated to address comments, add release notes, and do a CI run. I'm still running this locally to see if the new DB settings give me any trouble. |
Got a postgres unit test failure. Looking into it... Error is here. Looks similar to errors (chain notification-related) that happened on both postgres and sqlite unit test runs on #9260, which doesn't have parallel DB transactions re-enabled. I believe this is actually related to a flake in chainntnfs where a certain series of events makes the notifier attempt to send one more event than the (unread) channel has room for. The easiest fix is to add 2 to the channel allocation here, which has eliminated similar errors from all of our (Lightspark) CI runs in our private fork. This is in lieu of actually tracking down the flake further and fixing it "properly." I'll push it as soon as this CI run is complete. The updated DB settings seem to be working OK locally on my machine as well. |
Thanks a lot for digging into the unit test flake! That certainly was an easy way to fix and was originally used in #9258. Then I realized there's another place we need the dedup check which was fixed in 4632044. Since this PR is based on #9260 and that PR has this commit in its upstream, I'd expect it to be fixed properly there but it seems not? The other related fix is #9309 which fixed another flake. Moving forward, I think we can focus on getting the postgres-related merged, then open new PRs to fix the unit test flakes. |
I'm off until Monday but will remove the latest commit then. Thanks! |
a94aa1c
to
fb50cec
Compare
I tried to take off the latest commit and force-push but GitHub got stuck on |
The only test that failed looks irrelevant to the DB (protofsm unit-race), woohoo! We did get some coverage reductions, but the workflow isn't tracking coverage with itests that use the postgres backend. Maybe we should turn it on? |
5811026
to
a0cefe4
Compare
Could you do a rebase on |
Yep, I'll rebase this one, and then backport #9242/rebase it on current master. Thanks! |
fb50cec
to
0755e15
Compare
I've rebased this PR. Once CI runs, I'll backport/rebase #9242 to get it ready for merge. Thanks! |
The cross compilation test failed due to lack of disk space, which is also happening on #9260 CI runs. Looking into the postgres itest that failed. |
In the itest that failed, it looks like maybe the funding manager is deadlocked. Going to run flakehunter locally to see if I can reproduce and get a goroutine dump from the node that deadlocked (Dave in this case, I think). Looking at just the logs without a goroutine dump, it looks like the underlying deadlock might be in |
Also breifly looked at the logs, think the below 90s is weird, From Bob's log,
From Carol's log,
From Eve's log,
Looks like they are all blocked by |
5974d3e
to
88d6c4a
Compare
These are really similar symptoms to when we had the |
0755e15
to
3366773
Compare
No postgres itest errors this run, only linter complaints about line length in the temporary debug commit. I'm fixing the lint errors and pushing again to try to get an itest failure. |
3366773
to
eca7d32
Compare
I think the |
This time, the error looks to be due to the fact that the payment attempts were sent while the channel between Carol and Dave was still re-establishing, causing an insufficient balance/no route error because there was not yet an active channel between the nodes. |
d716d2e
to
df5d749
Compare
This reverts commit 67419a7.
To make this itest work reliably with multiple parallel SQL transactions, we need to count both the settle and final HTLC events. Otherwise, sometimes the final events from earlier forwards are counted before the forward events from later forwards, causing a miscount of the settle events. If we expect both the settle and final event for each forward, we don't miscount.
0d521f7
to
8acd2d5
Compare
Added handling for |
8acd2d5
to
307e059
Compare
Rebase of #9242 on #9260. Now includes btcsuite/btcwallet#967.