Intermitting Lambda Timeout #1259

jrb331 · 2024-07-18T19:11:42Z

I have had the CDK stack running in production for roughly 3 months without issue. Starting earlier this week, we are seeing an issue where the lambdas time out after the 10 minute max. S3 objects are stuck with a scan-status tag of IN PROGRESS or sometimes the scan hangs before they are event tagged. I saw this issue describing a similar problem, but updating the cdk-serverless-clamscan package to 2.6.230 did not seem to resolve the issue. I would appreciate any guidance you have here. I'm happy to provide any additional detail

The text was updated successfully, but these errors were encountered:

dontirun · 2024-07-19T12:32:51Z

Are there any error messages in the CloudWatch Logs?

jrb331 · 2024-07-19T13:26:59Z

All logs show a Task timed out after 600 seconds error message. Below is a handful of them. Is there any way to increase the verbosity of the logs?
log-events-viewer-result.xlsx

jrb331 · 2024-07-19T13:29:24Z

Also potentially worth noting that some objects are stuck in an IN PROGRESS state (which occurs here) after lambda failure, where some time out before the object gets tagged at all. That is the only indication of where the issue might be happening that I've been able to discern thus far

jrb331 · 2024-07-19T16:37:42Z

This timeline of metrics illustrates the problem well I think. I began with small numbers of concurrent uploads (75 every 5 minutes), and saw no issue. I began increasing the number of concurrent uploads first to 150 and then to 300. As you can see, there are still no durations close to long enough to cause a time out, and 100% of the executions were successful. Once I bumped the number up to 1000, however, the problem resurfaced. You can see the execution duration balloon to the max configured for the lambda, the errors begin to surface in bulk, the 3 retries being attempted (unsuccessfully), and finally, the majority of events being dropped. You can also see that I'm maxing out at a concurrency of ~800. I don't have any other lambdas in this account that fire frequently enough to push me over the 1000 executions limit. Hopefully this additional context is helpful

jrb331 · 2024-07-19T17:06:09Z

It seems like cold starts may be related. Some observations:

As you can see in the first photo, the logs where cold_start = 0 are correlated with low lambda execution times.
Similarly, the logs where cold_start = 1 are correlated with longer durations and timeouts. This of course makes sense because an execution will take longer when it first has to initialize an environment.
These correlations are true even when the lambda does not time out. For example, the jump in duration around the 11:30 mark corresponds with a spike in cold starts at the same time.
These cold starts also correspond to times when the number of concurrent requests increases. This also makes sense because more requests means additional environments need to be provisioned and thus cold starts are necessary.

The questions I have are the following: Why would an increased quantity of cold starts increase the amount of time each individual lambda takes to execute?

The above screenshots show the cold starts were required even for the batches that all executed successfully, so why does the task duration balloon when a large number of lambdas require a cold start?

dontirun · 2024-07-19T18:08:44Z

The questions I have are the following: Why would an increased quantity of cold starts increase the amount of time each individual lambda takes to execute?
The above screenshots show the cold starts were required even for the batches that all executed successfully, so why does the task duration balloon when a large number of lambdas require a cold start?

From the perspective of the Lambda Application code, a cold start means that the ClamAV and Freshclam process both need to start up. Both of those take up CPU/memory. In addition, Freshclam will need to update the local virus database from the EFS filesystem.

As to why the coldstarts are happening so frequently, that would need to be answered by AWS Support

jrb331 · 2024-07-19T18:15:02Z

Is it possible to increase the verbosity of the logs? Also, have you seen certain CPU/memory configurations function better than others? I know that 10,240 MB of memory is the default but is that recommended? I currently have 3008 MB provisioned.

dontirun · 2024-07-19T18:25:33Z

Is it possible to increase the verbosity of the logs

What specifically are you looking for? It's possible to add log statements as the Scanning Lambda function enters each "phase" of the process. The Lambda does log when a specific process throws an error, but timeouts wouldn't be logged in CloudWatch Logs as the execution halts.

jrb331 · 2024-07-19T18:30:04Z

I'd like to see if I can narrow in on where the execution is timing out. Adding log statements at each phase sounds like it would help achieve this. Could you help me understand how to enable this?

dontirun · 2024-07-19T18:30:28Z

Also, have you seen certain CPU/memory configurations function better than others? I know that 10,240 MB of memory is the default but is that recommended? I currently have 3008 MB provisioned.

This is going to be use case specific (file type, file size, scan frequency). I haven't performed testing that would provide any real guidance. I have personally seen that larger configurations result in faster executions both in cold starts and overall.

jrb331 · 2024-07-19T19:35:15Z

Amazing. Thank you for that. Is it possible to pull these changes into my project without a new version of the cdk-serverless-clamscan package?

dontirun · 2024-07-19T19:38:13Z

Amazing. Thank you for that. Is it possible to pull these changes into my project without a new version of the cdk-serverless-clamscan package?

You would have to manually edit your local files to reflect the changes. I can not guarantee that manual changes like this won't result in a failed deployment

jrb331 · 2024-07-19T20:14:39Z

Apologies if I'm misunderstanding your recommendation, but I don't host the files of this repo locally. I import them into my node project via the cdk-serverless-clamscan package and instantiate a ServerlessClamscan object in my cdk code. Would I have to host my own custom fork of this repo in order to achieve enhanced logging?

jrb331 · 2024-07-19T20:15:44Z

Also, have you seen certain CPU/memory configurations function better than others? I know that 10,240 MB of memory is the default but is that recommended? I currently have 3008 MB provisioned.

This is going to be use case specific (file type, file size, scan frequency). I haven't performed testing that would provide any real guidance. I have personally seen that larger configurations result in faster executions both in cold starts and overall.

Also, FWIW, I tried increasing the lambda memory to 10240 and am still facing the same issue. I also validated that the files being sent to the construct have sizes on the order of KBs, which I've been able to handle in the past

dontirun · 2024-07-19T20:40:23Z

Apologies if I'm misunderstanding your recommendation, but I don't host the files of this repo locally. I import them into my node project via the cdk-serverless-clamscan package and instantiate a ServerlessClamscan object in my cdk code. Would I have to host my own custom fork of this repo in order to achieve enhanced logging?

I recommend that you update to the latest version. If you don't want to (not recommended) you would need to directly modify the files in your node_modulesfolder and then deploy

dontirun mentioned this issue Jul 19, 2024

feat: verbose logging during scans #1262

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermitting Lambda Timeout #1259

Intermitting Lambda Timeout #1259

jrb331 commented Jul 18, 2024 •

edited

Loading

dontirun commented Jul 19, 2024

jrb331 commented Jul 19, 2024 •

edited

Loading

jrb331 commented Jul 19, 2024 •

edited

Loading

jrb331 commented Jul 19, 2024 •

edited

Loading

jrb331 commented Jul 19, 2024

dontirun commented Jul 19, 2024

jrb331 commented Jul 19, 2024

dontirun commented Jul 19, 2024

jrb331 commented Jul 19, 2024 •

edited

Loading

dontirun commented Jul 19, 2024

jrb331 commented Jul 19, 2024

dontirun commented Jul 19, 2024

jrb331 commented Jul 19, 2024

jrb331 commented Jul 19, 2024 •

edited

Loading

dontirun commented Jul 19, 2024

Intermitting Lambda Timeout #1259

Intermitting Lambda Timeout #1259

Comments

jrb331 commented Jul 18, 2024 • edited Loading

dontirun commented Jul 19, 2024

jrb331 commented Jul 19, 2024 • edited Loading

jrb331 commented Jul 19, 2024 • edited Loading

jrb331 commented Jul 19, 2024 • edited Loading

jrb331 commented Jul 19, 2024

dontirun commented Jul 19, 2024

jrb331 commented Jul 19, 2024

dontirun commented Jul 19, 2024

jrb331 commented Jul 19, 2024 • edited Loading

dontirun commented Jul 19, 2024

jrb331 commented Jul 19, 2024

dontirun commented Jul 19, 2024

jrb331 commented Jul 19, 2024

jrb331 commented Jul 19, 2024 • edited Loading

dontirun commented Jul 19, 2024

jrb331 commented Jul 18, 2024 •

edited

Loading

jrb331 commented Jul 19, 2024 •

edited

Loading

jrb331 commented Jul 19, 2024 •

edited

Loading

jrb331 commented Jul 19, 2024 •

edited

Loading

jrb331 commented Jul 19, 2024 •

edited

Loading

jrb331 commented Jul 19, 2024 •

edited

Loading