Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress halt with a certain future composition on the grammers-client API user side #282

Open
MOZGIII opened this issue Oct 25, 2024 · 3 comments

Comments

@MOZGIII
Copy link
Contributor

MOZGIII commented Oct 25, 2024

This is a very peculiar issue, but fortunately I have managed to come up with an relatively small example that reproduces this bug: https://github.com/MOZGIII/grammers-halt

The issue manifests with this:

telegram-cloud-photo-size-1-4961085924855361126-y

If the code was working as intended, the log line main::reconciler: before iter next should've been followed by something else - either an error, or main::reconciler: after iter next; however the code stalls and does not progress.

My investigation showed:

  • There are no blocking locks held by any threads that would block the further execution, and also there are no async locks held besides the one lock on the grammers_mtsender::Sender by the grammers_client::Client call of next_raw_update call under the hood - which is intended; in other words this is not deadlock in a general sense.
    Tested this via tokio console from async and lldb for sync mutexes.

  • The issue is not with tokio::sync::Mutex specifically, since I have tried swapping the mutex implementation in question with async_mutex::Mutex and got the same outcome - the app stalled.

  • The select chain at

    match select(select(sleep, recv_req), select(recv_data, send_data)).await {
    Either::Left((Either::Left(_), _)) => Sel::Sleep,
    Either::Left((Either::Right((request, _)), _)) => Sel::Request(request),
    Either::Right((Either::Left((n, _)), _)) => Sel::Read(n),
    Either::Right((Either::Right((n, _)), _)) => Sel::Write(n),
    }
    does not progress, despite the conditions for the progression of the individual futures being fulfilled; I do not mean the select is the issue here, because it seems that the reason they do not progress is because the whole task doesn't get scheduled - i.e. the futures (or, rather, the combined select future) do not progress because are not polled, meaning the task is simply not schedule - not because the futures are polled but are stuck.
    I have managed to test this by trying this code here:

    Screenshot of the code here.

    telegram-cloud-photo-size-1-4961085924855361127-y

    Sorry for the screenshot, didn't commit it, and this is all I got now.

    I wrapped the

    let sleep = pin!(async { sleep_until(self.next_ping).await });

    with tokio::spawn and added logging inside that spawned future after the sleep returns; then made the select wait on JoinHandle of that spawned task. After a minute, I saw the log message - which means that timer worked - but select that was polling on the JoinHandle didn't return - which means that the task that executes select was not scheduled upon JoinHandle returning.
    Rather odd, yet maybe I'm missing something.

I don't know what's going on there, but I suspect a tokio bug.


I have managed to create a workaround - essentially the same logic but executed with two independent spawned tasks instead of structured concurrency, it can be found at https://github.com/MOZGIII/grammers-halt/tree/working1.

See MOZGIII/grammers-halt@master...working1 for the changes that make the code work - does not solve the issue, but at least might be useful for someone who needs to work around this bug in the meantime.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Oct 26, 2024

The root cause has been boiled down to how locking inside of the invoke interacts with the locking at next_raw_update in a situation where, like we have in the example, we select the next_raw_update together with something else (in our case a timer) and inside that other select branch we do the invoke.
That leads to a deadlock, where invoke tries to lock on the sender mutex, and waits infinitely; normally this wouldn't be a problem as the current holder of the lock would make progress and relieve the invoke from needing to acquire the lock on sender by just returning the result - however our this case, the sender is locked by next_raw_update, and we are not in a select statement (macro) itself, but in a select branch - meaning that next_raw_update future is not polled anymore, despite actually being ready to progress. This is the ultimate reason why the sender.step isn't being polled, and why the sender can't make progress; couple that together with a lock on sender at invoke - and we get a deadlock.

Note that if invoke wouldn't try to lock on sender and would just wait for rx to get ready we'd still get the same deadlock, since the current sender driving future can't progress.

There are no issues with tokio - well, really, as expected - in fact, it can't do anything at that point because the composed future state machine simply does not allow to poll on that sender.send code branch.

I'd argue that grammers-client implementation should be altered to make it impossible to get in a situation where the client can't progress - especially since it was clearly designed to allow the progress to be made in this kind of a situation - meaning I'd consider this a bug.

@MOZGIII MOZGIII changed the title Async runtime task halt in grammers-client Progress halt with a certain future composition on the grammers-client API user side Oct 26, 2024
@MOZGIII
Copy link
Contributor Author

MOZGIII commented Oct 26, 2024

@Lonami how are you willing to proceed? I can work on a fix for this when I get back to it - but at the moment I'm busy elsewhere.

@Lonami
Copy link
Owner

Lonami commented Oct 26, 2024

I still need enough brain power to read through and truly understand the problem. I don't know when that'll be.

I'm not under any time pressure to see this fixed. So if you have a clear path forward on your mind, that'd be best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants