Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make test_esplora_syncs more robust #2033

Conversation

tnull
Copy link
Contributor

@tnull tnull commented Feb 15, 2023

Fixes #2032.

The test generally works fine and the calls in question should never fail. However, they did break eventually in CI.

I'm guessing that in the case of #2032 the electrsd for some reason hadn't finished launching yet when we tried to to access its interface. Therefore this PR adds some changes that should give the bitcoind/electrsd server instances some more time to fully get setup and generally some more leeway.

@tnull tnull force-pushed the 2023-02-make-esplora-sync-test-more-robust branch from ae0002c to 6d8dff9 Compare February 15, 2023 01:04
@tnull tnull marked this pull request as draft February 15, 2023 20:15
@tnull
Copy link
Contributor Author

tnull commented Feb 15, 2023

Just got another report of a WouldBlock socket error:

thread 'test_esplora_syncs' panicked at 'called `Result::unwrap()` on an `Err` value: JsonRpc(Transport(SocketError(Os { code: 35, kind: WouldBlock, message: "Resource temporarily unavailable" })))', lightning-transaction-sync/tests/integration_tests.rs:58:89

In this case the error source seems to be bitcoincore-rpc. Will investigate how to mitigate these, marked draft for now.

@TheBlueMatt
Copy link
Collaborator

That shounds like an underlying socket implementation is buggy. WouldBlock aka EAGAIN means you should just...go again.

@tnull
Copy link
Contributor Author

tnull commented Feb 15, 2023

That shounds like an underlying socket implementation is buggy. WouldBlock aka EAGAIN means you should just...go again.

Yeah, got similar reports before, but so far the hypothesis was that some process (virus scanner or similar) would have blocked/reset/messed with the RPC calls. While retrying might work for us in this instance (although not fully, as it still means we mine more blocks than necessary), this probably should be reported and fixed somewhere upstream. Will see if I can reliably reproduce the behavior on a really low-resource machine.

@TheBlueMatt
Copy link
Collaborator

TheBlueMatt commented Feb 15, 2023

Right, it should be fixed in whatever is doing the actual socket reads - that should retry on the socket. IIRC read calls are allowed to spuriously EAGAIN/EWOULDBLOCK (which are the same on Linux), so we always have to try again to be standards conformant.

@tnull
Copy link
Contributor Author

tnull commented Feb 15, 2023

Good point. Seems like a pretty common issue in the space. Source is likely rust-jsonrpc, possibly apoelstra/rust-jsonrpc#67, which should be fixed by apoelstra/rust-jsonrpc#72, i.e., with jsonrpc 0.14.0. A PR to update rust-bitcoincore-rpc to 0.14.0 this is on the way, which hopefully eventually arrive in rust-bitcoind. Generally these kind of issues are likely only fixed for good after transitioning away from jsonrpc's SimpleHTTP.

TLDR: might be a while until upstream is fixed. I wonder if I should add another simpler test case somehow, and enable the error prone one for now, if it keeps failing our CI?

@TheBlueMatt
Copy link
Collaborator

Let's wait and see what kind of failure rate we see. So far I've only seen it once so it may not be worth worrying about.

@tnull
Copy link
Contributor Author

tnull commented Feb 15, 2023

Let's wait and see what kind of failure rate we see. So far I've only seen it once so it may not be worth worrying about.

Alright. As we don't use the result anyways and we still wait for enough blocks to arrive, I for now just removed the unwrap() in question.

@tnull tnull marked this pull request as ready for review February 15, 2023 22:05
@valentinewallace
Copy link
Contributor

I'm getting this error at https://github.com/lightningdevkit/rust-lightning/actions/runs/4198096993/jobs/7281278685:

error: failed to run custom build command for `bitcoind v0.28.1`

Caused by:
  process didn't exit successfully: `/home/runner/work/rust-lightning/rust-lightning/target/debug/build/bitcoind-db625c734508c201/build-script-build` (exit status: 101)
  --- stdout
  filename:bitcoin-23.0-x86_64-linux-gnu.tar.gz version:23.0 hash:2cca490c1f2842884a3c5b0606f179f9f937177da4eadd628e3f7fd7e25d26d0
  url:https://bitcoincore.org/bin/bitcoin-core-23.0/bitcoin-23.0-x86_64-linux-gnu.tar.gz

  --- stderr
  thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Transport(Transport { kind: Io, message: None, url: Some(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("bitcoincore.org")), port: None, path: "/bin/bitcoin-core-23.0/bitcoin-23.0-x86_64-linux-gnu.tar.gz", query: None, fragment: None }), source: Some(Custom { kind: TimedOut, error: "timed out reading response" }) })', /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/bitcoind-0.28.1/build.rs:106:43

Should this be fixed by this PR? I think it's unrelated to the linked PR

@tnull
Copy link
Contributor Author

tnull commented Feb 18, 2023

I'm getting this error at https://github.com/lightningdevkit/rust-lightning/actions/runs/4198096993/jobs/7281278685:

error: failed to run custom build command for `bitcoind v0.28.1`

Caused by:
  process didn't exit successfully: `/home/runner/work/rust-lightning/rust-lightning/target/debug/build/bitcoind-db625c734508c201/build-script-build` (exit status: 101)
  --- stdout
  filename:bitcoin-23.0-x86_64-linux-gnu.tar.gz version:23.0 hash:2cca490c1f2842884a3c5b0606f179f9f937177da4eadd628e3f7fd7e25d26d0
  url:https://bitcoincore.org/bin/bitcoin-core-23.0/bitcoin-23.0-x86_64-linux-gnu.tar.gz

  --- stderr
  thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Transport(Transport { kind: Io, message: None, url: Some(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("bitcoincore.org")), port: None, path: "/bin/bitcoin-core-23.0/bitcoin-23.0-x86_64-linux-gnu.tar.gz", query: None, fragment: None }), source: Some(Custom { kind: TimedOut, error: "timed out reading response" }) })', /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/bitcoind-0.28.1/build.rs:106:43

Should this be fixed by this PR? I think it's unrelated to the linked PR

Oh wow, this is a new one. Looks like this is not coming from on our side, but while building the used bitcoind crate. Will have a closer look and possibly report upstream. This seems to be happening because Github CI counldn't reach bitcoincore.org for some reason.

@tnull
Copy link
Contributor Author

tnull commented Feb 18, 2023

Oh wow, this is a new one. Looks like this is not coming from on our side, but while building the used bitcoind crate. Will have a closer look and possibly report upstream. This seems to be happening because Github CI counldn't reach bitcoincore.org for some reason.

Yeah, seems this is simply an error coming from Github failing to reach bitcoincore.org. I think nothing we or upstream can really do about this, maybe mod a prettier error message. If this happens more often we may want to investigate further why the host isn't reachable, and as a last resort look into hosting our own mirror with better availability. Now opened an issue to suggest mirror-configurability upstream: rust-bitcoin/bitcoind#106

let cur_height = get_bitcoind().client.get_block_count().expect("failed to get current block height");
let address = get_bitcoind().client.get_new_address(Some("test"), Some(AddressType::Legacy)).expect("failed to get new address");
// TODO: expect this Result once the WouldBlock issue is resolved upstream.
let _block_hashes_res = get_bitcoind().client.generate_to_address(num as u64, &address);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the blocks actually being mined here such that not unwrapping the error actually moves things forward? Otherwise, won't the test just fail in wait_for_block instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, from the user reports I got it indeed seems to be the case that bitcoind keeps generating the blocks in the background. So we should pick them up eventually via wait_for_block.

@wpaulino
Copy link
Contributor

Feel free to squash. IMO it doesn't really hurt to merge this since it's just an extra second delay (though hard to tell yet if that's enough), even if we're not seeing failures often.

The test generally works and the parts in question *should* never fail.
However, they did, so we give the test server instances some leeway.
@tnull tnull force-pushed the 2023-02-make-esplora-sync-test-more-robust branch from 1058150 to e878f75 Compare February 24, 2023 02:05
@tnull
Copy link
Contributor Author

tnull commented Feb 24, 2023

Feel free to squash. IMO it doesn't really hurt to merge this since it's just an extra second delay (though hard to tell yet if that's enough), even if we're not seeing failures often.

Squashed without further changes. I think probably the biggest improvement in terms of robustness here is the retrying of failing calls. Also agree regarding the added delays: 1 second is really arbitrarily chosen and could go both ways: it might be to little in some cases, but also likely is an unnecessary delay in most cases. We could also leave them out for now and see if most issues are already addressed by the other fixes?

@TheBlueMatt TheBlueMatt merged commit 596ef3b into lightningdevkit:main Feb 27, 2023
tnull added a commit to tnull/ldk-node that referenced this pull request Mar 24, 2023
`OnceCell` doesn't call `drop`, which makes the spawned
`bitcoind`/`electrsd` instances linger around after our tests have
finished. To fix this, we move them out of `OnceCell` and let every test
that needs them spawn their own instances. This additional let us drop
the `OnceCell` dev dependency.

Additionally, we improve the test robustness by applying most of the
changes from
lightningdevkit/rust-lightning#2033.
tnull added a commit to tnull/ldk-node that referenced this pull request Mar 30, 2023
`OnceCell` doesn't call `drop`, which makes the spawned
`bitcoind`/`electrsd` instances linger around after our tests have
finished. To fix this, we move them out of `OnceCell` and let every test
that needs them spawn their own instances. This additional let us drop
the `OnceCell` dev dependency.

Additionally, we improve the test robustness by applying most of the
changes from
lightningdevkit/rust-lightning#2033.
tnull added a commit to tnull/ldk-node that referenced this pull request Apr 7, 2023
`OnceCell` doesn't call `drop`, which makes the spawned
`bitcoind`/`electrsd` instances linger around after our tests have
finished. To fix this, we move them out of `OnceCell` and let every test
that needs them spawn their own instances. This additional let us drop
the `OnceCell` dev dependency.

Additionally, we improve the test robustness by applying most of the
changes from
lightningdevkit/rust-lightning#2033.
tnull added a commit to lightningdevkit/ldk-node that referenced this pull request Apr 17, 2023
* Add inital implementation of persisted payment store

* Implement `KVStoreUnpersister`

* Reorganize IO things into `io` submodules

* Reorganize test folder for consistency

* Move test `bitcoind`/`electrsd` out of `OnceCell`

`OnceCell` doesn't call `drop`, which makes the spawned
`bitcoind`/`electrsd` instances linger around after our tests have
finished. To fix this, we move them out of `OnceCell` and let every test
that needs them spawn their own instances. This additional let us drop
the `OnceCell` dev dependency.

Additionally, we improve the test robustness by applying most of the
changes from
lightningdevkit/rust-lightning#2033.

* Introduce `KVStore` trait and `FilesystemStore` impl

Rather than further relying on the upstream
`KVStorePersister`/`KVStoreUnpersister`, we here implement a general
`KVStore` trait that allows access to `Read`s/`TransactionalWrite`s
which may be used to deserialize/serialize data via the
`Readable`/`Writeable` implementations.

Notably `TransactionalWrite` is a `Write` for which the written data
needs to be explictly `commit`ed, asserting that we always persist
either the whole new change or no change at all.

Additionally, we avoid the `Info` umbrella term but opt to introduce `PaymentDetails`
to align naming with upcoming `PeerDetails` and `ChannelDetails`.

* Unpin BDK/esplora-client
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_esplora_sync is flaky
4 participants