You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have some weird cases of tasks getting stuck mid execution. These tasks just keep getting
re-executed, since the RabbitMQ will eventually timeout (after 30m) waiting on an ACK and just
terminates the connection with the client (nacking all messages it had in-flight). The task is then
re-executed as if nothing happened [1].
[1] This is also another bug that we should address. Should just fail if the task had already tried
running before and just disappeared, which we can already tell from the metadata in the API. This
is not the root cause though so we still need to investigate and fix the stuck tasks.
No logs that indicate what is wrong, but I have a light suspicion on either:
the "progress reporting" logic
the "piping" logic in the import task which sends a stream both to ffprobe and to the storage
the S3 upload client
On the first tasks I found this error, they were actually importing large stream recordings which take
12+ minutes to download on a good connection, due to the on-demand MP4 generation bottleneck.
It was already weird since we have a hard timeout of 10 minutes so the task runner
should have just failed the task, instead of gone silent.
Right now I just found an even weirder case though. It was from a regular "import" task, which is not
importing a recording but actually just another asset as a test that the user was making. This is the
task:
The asset has around 5GB and takes less than a minute to download from a good connection, so there's
no clear reason of why the task-runner is getting stuck.
The text was updated successfully, but these errors were encountered:
We have some tasks that are being re-executed
over and over again since they get stuck in the
task-runner logic. We should fix the root cause
of those, but to avoid the problem from getting
worse we should also avoid re-running these tasks
over and over again.
This fixes that by not even starting tasks that we
find out had already been started (phase=running).
This is related to #19
We have some tasks that are being re-executed
over and over again since they get stuck in the
task-runner logic. We should fix the root cause
of those, but to avoid the problem from getting
worse we should also avoid re-running these tasks
over and over again.
This fixes that by not even starting tasks that we
find out had already been started (phase=running).
This is related to #19
We have some weird cases of tasks getting stuck mid execution. These tasks just keep getting
re-executed, since the RabbitMQ will eventually timeout (after 30m) waiting on an ACK and just
terminates the connection with the client (nacking all messages it had in-flight). The task is then
re-executed as if nothing happened [1].
No logs that indicate what is wrong, but I have a light suspicion on either:
ffprobe
and to the storageOn the first tasks I found this error, they were actually importing large stream recordings which take
12+ minutes to download on a good connection, due to the on-demand MP4 generation bottleneck.
It was already weird since we have a hard timeout of 10 minutes so the task runner
should have just failed the task, instead of gone silent.
Right now I just found an even weirder case though. It was from a regular "import" task, which is not
importing a recording but actually just another asset as a test that the user was making. This is the
task:
The asset has around 5GB and takes less than a minute to download from a good connection, so there's
no clear reason of why the task-runner is getting stuck.
The text was updated successfully, but these errors were encountered: