You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a .NET Isolated worker runs multiple function invocations concurrently, one of these invocations times out, and the .NET language worker is restarted (expected), the other invocations are obviously interrupted. However, these other invocations seem to be still monitored by the host, and the host considers these invocations timed out after the time period configured by functionTimeout, even though these invocations have not had a chance to use all this time, as they were interrupted early. This leads to at least two problems:
Confusing and factually incorrect log messages (there is no timeout at this point, and there is no extra language worker restart at this point):
Timeout value of Value exceeded by function '<FunctionName>' (Id: '<InvocationId>'). Initiating cancellation.
Microsoft.Azure.WebJobs.Host: Timeout value of <Value> was exceeded by function: <FunctionName>.
A function timeout has occurred. Restarting worker process executing invocationId '<InvocationId>'.
Restart of language worker process(es) completed.
These timeout notifications are sent to middleware that may perform something destructive. Specifically, if this function happens to be a Durable activity, the Durable Functions middleware marks this activity invocation as failed in the orchestration history and deletes this activity message from the queue. As a result, the activity is never retried and the timeout exception is surfaced to the orchestrator code, which can cause the orchestration to fail. This is the root cause of Activities fail with timeouts when the .NET worker process crashes · Issue #2939 · Azure/azure-functions-durable-extension
I haven't tested these, but:
the same may happen when the .NET worker process crashes for other reasons;
other out-of-proc languages/stacks may be impacted as well.
Repro steps
Can be easily reproduced locally with Core Tools 4.0.6280:
Create and start a .NET Isolated Function app locally.
Invoke a function that will take longer than functionTimeout to complete.
While the first invocation is in progress, invoke a long-running function again.
Observe the logs.
You can use this repro app: AnatoliB/af-dotnet-isolated-delay. If you use this app, steps 2 and 3 can be performed by the following PowerShell command:
A function timeout has occurred. Restarting worker process executing invocationId '6d2f2ee4-d1d4-4910-9821-b48986e8e3f9'
Worker process started and initialized.
additional notifications for the invocations that were interrupted by the worker restart (5bfc6e84-8f82-47c2-b0d7-e4794f532836); these notifications should not indicate a timeout - the logs should be different and the notification sent to the middleware should make it possible for the middleware to handle this situation differently.
I don't expect any further timeout notifications for the other invocations (such as 5bfc6e84-8f82-47c2-b0d7-e4794f532836), like all the messages after 03:19:05 in the log above.
Actual behavior
When the first invocation (6d2f2ee4-d1d4-4910-9821-b48986e8e3f9) times out:
the host correctly sends the expected notifications and restarts the language worker (03:18:45 in the log above) - so far so good;
the host does not provide any notification about other invocations that happened to be running in the same worker at the same time - problem 1.
Later (functionTimeout after the second invocation started):
the host sends a suprising notification about the other invocation (5bfc6e84-8f82-47c2-b0d7-e4794f532836) timing out, and this notification is sent to middleware - problem 2;
the host logs messages suggesting that a language worker is restarted (which is fortunately not true, the worker is not restarted again, but the messages are confusing) - problem 3:
A function timeout has occurred. Restarting worker process executing invocationId '5bfc6e84-8f82-47c2-b0d7-e4794f532836'.
Restart of language worker process(es) completed.
Known workarounds
There is no known generic workaround. Depending on the context, the user may be able to tolerate this behavior (e.g. retry invocations if needed).
The text was updated successfully, but these errors were encountered:
When a .NET Isolated worker runs multiple function invocations concurrently, one of these invocations times out, and the .NET language worker is restarted (expected), the other invocations are obviously interrupted. However, these other invocations seem to be still monitored by the host, and the host considers these invocations timed out after the time period configured by
functionTimeout
, even though these invocations have not had a chance to use all this time, as they were interrupted early. This leads to at least two problems:Timeout value of Value exceeded by function '<FunctionName>' (Id: '<InvocationId>'). Initiating cancellation.
Microsoft.Azure.WebJobs.Host: Timeout value of <Value> was exceeded by function: <FunctionName>.
A function timeout has occurred. Restarting worker process executing invocationId '<InvocationId>'.
Restart of language worker process(es) completed.
I haven't tested these, but:
Repro steps
Can be easily reproduced locally with Core Tools 4.0.6280:
functionTimeout
to complete.You can use this repro app: AnatoliB/af-dotnet-isolated-delay. If you use this app, steps 2 and 3 can be performed by the following PowerShell command:
Here is a typical result:
Expected behavior
Microsoft.Azure.WebJobs.Host: Timeout value of 00:00:30 was exceeded by function: Functions.Delay.
Executed 'Functions.Delay' (Failed, Id=6d2f2ee4-d1d4-4910-9821-b48986e8e3f9, Duration=30062ms)
A function timeout has occurred. Restarting worker process executing invocationId '6d2f2ee4-d1d4-4910-9821-b48986e8e3f9'
Worker process started and initialized.
Actual behavior
functionTimeout
after the second invocation started):A function timeout has occurred. Restarting worker process executing invocationId '5bfc6e84-8f82-47c2-b0d7-e4794f532836'.
Restart of language worker process(es) completed.
Known workarounds
There is no known generic workaround. Depending on the context, the user may be able to tolerate this behavior (e.g. retry invocations if needed).
The text was updated successfully, but these errors were encountered: