Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected invocation timeout notifications after terminating the .NET Isolated worker process #10580

Open
AnatoliB opened this issue Oct 30, 2024 · 0 comments

Comments

@AnatoliB
Copy link
Contributor

When a .NET Isolated worker runs multiple function invocations concurrently, one of these invocations times out, and the .NET language worker is restarted (expected), the other invocations are obviously interrupted. However, these other invocations seem to be still monitored by the host, and the host considers these invocations timed out after the time period configured by functionTimeout, even though these invocations have not had a chance to use all this time, as they were interrupted early. This leads to at least two problems:

  1. Confusing and factually incorrect log messages (there is no timeout at this point, and there is no extra language worker restart at this point):
    • Timeout value of Value exceeded by function '<FunctionName>' (Id: '<InvocationId>'). Initiating cancellation.
    • Microsoft.Azure.WebJobs.Host: Timeout value of <Value> was exceeded by function: <FunctionName>.
    • A function timeout has occurred. Restarting worker process executing invocationId '<InvocationId>'.
    • Restart of language worker process(es) completed.
  2. These timeout notifications are sent to middleware that may perform something destructive. Specifically, if this function happens to be a Durable activity, the Durable Functions middleware marks this activity invocation as failed in the orchestration history and deletes this activity message from the queue. As a result, the activity is never retried and the timeout exception is surfaced to the orchestrator code, which can cause the orchestration to fail. This is the root cause of Activities fail with timeouts when the .NET worker process crashes · Issue #2939 · Azure/azure-functions-durable-extension

I haven't tested these, but:

  • the same may happen when the .NET worker process crashes for other reasons;
  • other out-of-proc languages/stacks may be impacted as well.

Repro steps

Can be easily reproduced locally with Core Tools 4.0.6280:

  1. Create and start a .NET Isolated Function app locally.
  2. Invoke a function that will take longer than functionTimeout to complete.
  3. While the first invocation is in progress, invoke a long-running function again.
  4. Observe the logs.

You can use this repro app: AnatoliB/af-dotnet-isolated-delay. If you use this app, steps 2 and 3 can be performed by the following PowerShell command:

0..1 | ForEach-Object -Parallel { sleep ($_ * 20); irm "http://localhost:7071/api/Delay?name=$_&duration=60" }

Here is a typical result:

[2024-10-30T03:18:15.579Z] Executing 'Functions.Delay' (Reason='This function was programmatically called via the host APIs.', Id=6d2f2ee4-d1d4-4910-9821-b48986e8e3f9)
[2024-10-30T03:18:15.752Z] 0: 0 (PID: 7404)
[2024-10-30T03:18:15.752Z] 0: Delay duration: 60 seconds
[2024-10-30T03:18:16.761Z] 0: 1 (PID: 7404)
[2024-10-30T03:18:17.764Z] 0: 2 (PID: 7404)
[2024-10-30T03:18:18.778Z] 0: 3 (PID: 7404)
...
[2024-10-30T03:18:33.904Z] 0: 18 (PID: 7404)
[2024-10-30T03:18:34.908Z] 0: 19 (PID: 7404)
[2024-10-30T03:18:35.459Z] Executing 'Functions.Delay' (Reason='This function was programmatically called via the host APIs.', Id=5bfc6e84-8f82-47c2-b0d7-e4794f532836)
[2024-10-30T03:18:35.466Z] 1: Delay duration: 60 seconds
[2024-10-30T03:18:35.466Z] 1: 0 (PID: 7404)
[2024-10-30T03:18:35.909Z] 0: 20 (PID: 7404)
[2024-10-30T03:18:36.478Z] 1: 1 (PID: 7404)
[2024-10-30T03:18:36.924Z] 0: 21 (PID: 7404)
...
[2024-10-30T03:18:45.003Z] 0: 29 (PID: 7404)
[2024-10-30T03:18:45.555Z] 1: 10 (PID: 7404)
[2024-10-30T03:18:45.595Z] Timeout value of 00:00:30 exceeded by function 'Functions.Delay' (Id: '6d2f2ee4-d1d4-4910-9821-b48986e8e3f9'). Initiating cancellation.
[2024-10-30T03:18:45.618Z] Executed 'Functions.Delay' (Failed, Id=6d2f2ee4-d1d4-4910-9821-b48986e8e3f9, Duration=30062ms)
[2024-10-30T03:18:45.619Z] Microsoft.Azure.WebJobs.Host: Timeout value of 00:00:30 was exceeded by function: Functions.Delay.
[2024-10-30T03:18:46.016Z] 0: 30 (PID: 7404)
[2024-10-30T03:18:46.564Z] 1: 11 (PID: 7404)
[2024-10-30T03:18:47.018Z] 0: 31 (PID: 7404)
[2024-10-30T03:18:47.569Z] 1: 12 (PID: 7404)
[2024-10-30T03:18:47.651Z] A function timeout has occurred. Restarting worker process executing invocationId '6d2f2ee4-d1d4-4910-9821-b48986e8e3f9'.
[2024-10-30T03:18:48.032Z] 0: 32 (PID: 7404)
[2024-10-30T03:18:48.213Z] Worker process started and initialized.
[2024-10-30T03:18:48.579Z] 1: 13 (PID: 7404)
[2024-10-30T03:18:49.033Z] 0: 33 (PID: 7404)
...
[2024-10-30T03:18:52.063Z] 0: 36 (PID: 7404)
[2024-10-30T03:18:52.617Z] 1: 17 (PID: 7404)
[2024-10-30T03:18:52.686Z] Failed to proxy request to the worker. Retrying in 50ms. Attempt 1 of 3.
[2024-10-30T03:18:52.686Z] Failed to proxy request to the worker. Retrying in 50ms. Attempt 1 of 3.
[2024-10-30T03:18:56.798Z] Failed to proxy request to the worker. Retrying in 100ms. Attempt 2 of 3.
[2024-10-30T03:18:56.798Z] Failed to proxy request to the worker. Retrying in 100ms. Attempt 2 of 3.
[2024-10-30T03:18:58.233Z] Restart of language worker process(es) completed.
[2024-10-30T03:19:05.465Z] Timeout value of 00:00:30 exceeded by function 'Functions.Delay' (Id: '5bfc6e84-8f82-47c2-b0d7-e4794f532836'). Initiating cancellation.
[2024-10-30T03:19:05.467Z] Executed 'Functions.Delay' (Failed, Id=5bfc6e84-8f82-47c2-b0d7-e4794f532836, Duration=30008ms)
[2024-10-30T03:19:05.468Z] Microsoft.Azure.WebJobs.Host: Timeout value of 00:00:30 was exceeded by function: Functions.Delay.
[2024-10-30T03:19:07.472Z] A function timeout has occurred. Restarting worker process executing invocationId '5bfc6e84-8f82-47c2-b0d7-e4794f532836'.
[2024-10-30T03:19:07.474Z] Restart of language worker process(es) completed.

Expected behavior

  • When the first invocation (6d2f2ee4-d1d4-4910-9821-b48986e8e3f9) times out, I expect:
    • notifications about that specific invocation timing out and the language worker process being restarted, such as:
      • Microsoft.Azure.WebJobs.Host: Timeout value of 00:00:30 was exceeded by function: Functions.Delay.
      • Executed 'Functions.Delay' (Failed, Id=6d2f2ee4-d1d4-4910-9821-b48986e8e3f9, Duration=30062ms)
      • A function timeout has occurred. Restarting worker process executing invocationId '6d2f2ee4-d1d4-4910-9821-b48986e8e3f9'
      • Worker process started and initialized.
    • additional notifications for the invocations that were interrupted by the worker restart (5bfc6e84-8f82-47c2-b0d7-e4794f532836); these notifications should not indicate a timeout - the logs should be different and the notification sent to the middleware should make it possible for the middleware to handle this situation differently.
  • I don't expect any further timeout notifications for the other invocations (such as 5bfc6e84-8f82-47c2-b0d7-e4794f532836), like all the messages after 03:19:05 in the log above.

Actual behavior

  • When the first invocation (6d2f2ee4-d1d4-4910-9821-b48986e8e3f9) times out:
    • the host correctly sends the expected notifications and restarts the language worker (03:18:45 in the log above) - so far so good;
    • the host does not provide any notification about other invocations that happened to be running in the same worker at the same time - problem 1.
  • Later (functionTimeout after the second invocation started):
    • the host sends a suprising notification about the other invocation (5bfc6e84-8f82-47c2-b0d7-e4794f532836) timing out, and this notification is sent to middleware - problem 2;
    • the host logs messages suggesting that a language worker is restarted (which is fortunately not true, the worker is not restarted again, but the messages are confusing) - problem 3:
      • A function timeout has occurred. Restarting worker process executing invocationId '5bfc6e84-8f82-47c2-b0d7-e4794f532836'.
      • Restart of language worker process(es) completed.

Known workarounds

There is no known generic workaround. Depending on the context, the user may be able to tolerate this behavior (e.g. retry invocations if needed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant