Skip to content
This repository has been archived by the owner on Nov 15, 2019. It is now read-only.

Failing of connection establishment and missing messages on rust-utp socket test #439

Closed
maqi opened this issue Dec 2, 2015 · 7 comments
Assignees

Comments

@maqi
Copy link
Contributor

maqi commented Dec 2, 2015

As detailed described in the Rust-UTP PR maidsafe-archive/rust-utp#8

the test proves a failing rate at around : 0.005% when socket without timeout and 0.25% when socket with timeout.

according to the work of Crust PR #437 , the test proves the crust level failing rate is around the same as the rust-utp when socket with timeout.

@vinipsmaker
Copy link
Contributor

I'm suspecting this is being caused by the congestion control algorithm (and the changes brought by the timeout branch). I'm going to investigate it in detail.

@vinipsmaker
Copy link
Contributor

I based my work on top of @maqi's test_network_utp branch.

After some attempts, debugging hours, observations and so on, I'm suspecting the problem is on the logic of the network. I believe the problem is on the congestion control algorithm from rust-utp. Missing packets make uTP unreliable. uTP isn't working properly under unreliable connections (which is exactly where UDP is supposed to be used). On the tests done, it was observed that the use of custom user timeout -- which is a feature used by crust to make concurrent reads/writes possible -- would increase the failure/hang rate.

My newest try was to change CHECK_FOR_NEW_WRITES_INTERVAL_MS to support my claim/suspicion. Making the check more "granular" would increase the chance to not be reading when the packet was sent, which would cause a "lost packet". The "lost packet" is exactly the reason why I believe the test is failing.

Under the usual 50i64, the rates I got from my machine were:

ok err
6 15

When I changed the value to 200i64, the stats were:

ok err
11 10

With my suspicion confirmed (actually, I'd need to run more tests, but they take too long to complete and I already have something to change and test to see if the problem goes away), the next step is to study the congestion control algorithm of rust-utp/uTP and propose a solution to make it more reliable (and maybe give better guarantees on connection timed out). Until a definitive solution is implemented, crust could change the CHECK_FOR_NEW_WRITES_INTERVAL_MS value to decrease the rate failure and use ping/pong messages to detect connection loss itself.

Also, we could implement solution #2 from meqif/rust-utp#4 (comment), which would make the design more reliable.

@vinipsmaker
Copy link
Contributor

So, I found out that one PR of ours broke µTP LEDBAT.

Previously rust-utp would do a read without a timeout and would block until this timeout would be reached. So the algorithm was coded in such a way that, when the timeout was reached, it'd resend the packet at the correct time, as specified per LEDBAT. And one of the properties from LEDBAT is that it'd "back off" and slowdown exactly when the network usage is high, which is exactly what we're doing the expose the failure. Also, the maximum number of retries would be reached way sooner with small timeouts.

So, my idea to fix the problem is to refactor the code in such a way that we'd compute the passed time ourselves. This means storing the "time point" the last packet was resent (instead of relying on the wrong assumption that enough-time/the-right-amount-of-time has passed on the line following) and moving lines around such that if we're just coming back after a user timeout happened, we'll only do a read (and nothing else!) to wait for the packet for more time.

@vinipsmaker
Copy link
Contributor

Kind of my changes are ready for some time already, but when I try to test against test_network_utp from modified (with changes for utp_wrapper reverted) #442, no messages are received at all (actual error being thread 'Service::listen' panicked at 'Error { repr: Custom(Custom { kind: InvalidData, error: StringError("Failed to decode CBOR") }) }', src/state.rs:165 and src/state.rs:165 being trans.receiver.receive_handshake().unwrap()).

I've been debugging what's wrong during this time. Actually I somehow might have broken the connect function from rust-utp (which calls recv, which is the function I needed to change).

@vinipsmaker
Copy link
Contributor

I've been debugging what's wrong during this time. Actually I somehow might have broken the connect function from rust-utp (which calls recv, which is the function I needed to change).

It should be fixed now with the latest version of the set-read-timeout patch:

Upstream project's CI will timeout, but if you let the test run a little longer, tests will pass.

@vinipsmaker
Copy link
Contributor

It's merged on master now. I'll create a proper PR with the Windows build fix that was suggested earlier among our core devs and then:

  1. Update the network test.
  2. Confirm the Windows hanging issue.
  3. Find an alternative fix that doesn't break current behaviour/usage-patterns.

@vinipsmaker
Copy link
Contributor

Gonna close this issue as the rust-utp reliability problem was solved. We should open another issue to track Windows hanging issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants