Import tasks getting stuck mid execution #19

victorges · 2022-04-29T22:36:46Z

We have some weird cases of tasks getting stuck mid execution. These tasks just keep getting
re-executed, since the RabbitMQ will eventually timeout (after 30m) waiting on an ACK and just
terminates the connection with the client (nacking all messages it had in-flight). The task is then
re-executed as if nothing happened [1].

[1] This is also another bug that we should address. Should just fail if the task had already tried
running before and just disappeared, which we can already tell from the metadata in the API. This
is not the root cause though so we still need to investigate and fix the stuck tasks.

No logs that indicate what is wrong, but I have a light suspicion on either:

the "progress reporting" logic
the "piping" logic in the import task which sends a stream both to ffprobe and to the storage
the S3 upload client

On the first tasks I found this error, they were actually importing large stream recordings which take
12+ minutes to download on a good connection, due to the on-demand MP4 generation bottleneck.
It was already weird since we have a hard timeout of 10 minutes so the task runner
should have just failed the task, instead of gone silent.

Right now I just found an even weirder case though. It was from a regular "import" task, which is not
importing a recording but actually just another asset as a test that the user was making. This is the
task:

{
    "id": "51ea2a1e-618e-452d-a024-7c5a0ace266f",
    "type": "import",
    "params": {
        "import": {
            "url": "https://livepeercdn.com/asset/REDACTED/video"
        }
    },
    "status": {
        "phase": "running",
        "progress": 0.649,
        "updatedAt": 1651269956139
    },
    "userId": "REDACTED",
    "createdAt": 1650886712179,
    "outputAssetId": "4582de3b-ead3-4ffe-8b6d-b130f61290a1"
}

The asset has around 5GB and takes less than a minute to download from a good connection, so there's
no clear reason of why the task-runner is getting stuck.

The text was updated successfully, but these errors were encountered:

We have some tasks that are being re-executed over and over again since they get stuck in the task-runner logic. We should fix the root cause of those, but to avoid the problem from getting worse we should also avoid re-running these tasks over and over again. This fixes that by not even starting tasks that we find out had already been started (phase=running). This is related to #19

victorges mentioned this issue Apr 29, 2022

task/runner: Avoid re-running poison tasks #20

Merged

victorges mentioned this issue May 3, 2022

Stuck VOD tasks livepeer/studio#1103

Open

This was referenced Jun 11, 2022

segmenter: Fix error handling livepeer/stream-tester#164

Merged

task: Update go-api-client and stream-tester with fixes #43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import tasks getting stuck mid execution #19

Import tasks getting stuck mid execution #19

victorges commented Apr 29, 2022

Import tasks getting stuck mid execution #19

Import tasks getting stuck mid execution #19

Comments

victorges commented Apr 29, 2022