Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermitting Lambda Timeout #1259

Open
jrb331 opened this issue Jul 18, 2024 · 15 comments
Open

Intermitting Lambda Timeout #1259

jrb331 opened this issue Jul 18, 2024 · 15 comments

Comments

@jrb331
Copy link

jrb331 commented Jul 18, 2024

I have had the CDK stack running in production for roughly 3 months without issue. Starting earlier this week, we are seeing an issue where the lambdas time out after the 10 minute max. S3 objects are stuck with a scan-status tag of IN PROGRESS or sometimes the scan hangs before they are event tagged. I saw this issue describing a similar problem, but updating the cdk-serverless-clamscan package to 2.6.230 did not seem to resolve the issue. I would appreciate any guidance you have here. I'm happy to provide any additional detail
Screenshot 2024-07-18 at 3 11 27 PM

@dontirun
Copy link
Contributor

Are there any error messages in the CloudWatch Logs?

@jrb331
Copy link
Author

jrb331 commented Jul 19, 2024

All logs show a Task timed out after 600 seconds error message. Below is a handful of them. Is there any way to increase the verbosity of the logs?
log-events-viewer-result.xlsx

@jrb331
Copy link
Author

jrb331 commented Jul 19, 2024

Also potentially worth noting that some objects are stuck in an IN PROGRESS state (which occurs here) after lambda failure, where some time out before the object gets tagged at all. That is the only indication of where the issue might be happening that I've been able to discern thus far

@jrb331
Copy link
Author

jrb331 commented Jul 19, 2024

Screenshot 2024-07-19 at 12 41 37 PM

This timeline of metrics illustrates the problem well I think. I began with small numbers of concurrent uploads (75 every 5 minutes), and saw no issue. I began increasing the number of concurrent uploads first to 150 and then to 300. As you can see, there are still no durations close to long enough to cause a time out, and 100% of the executions were successful. Once I bumped the number up to 1000, however, the problem resurfaced. You can see the execution duration balloon to the max configured for the lambda, the errors begin to surface in bulk, the 3 retries being attempted (unsuccessfully), and finally, the majority of events being dropped. You can also see that I'm maxing out at a concurrency of ~800. I don't have any other lambdas in this account that fire frequently enough to push me over the 1000 executions limit. Hopefully this additional context is helpful

@jrb331
Copy link
Author

jrb331 commented Jul 19, 2024

Screenshot 2024-07-19 at 12 51 50 PM Screenshot 2024-07-19 at 12 52 10 PM

It seems like cold starts may be related. Some observations:

  • As you can see in the first photo, the logs where cold_start = 0 are correlated with low lambda execution times.
  • Similarly, the logs where cold_start = 1 are correlated with longer durations and timeouts. This of course makes sense because an execution will take longer when it first has to initialize an environment.
  • These correlations are true even when the lambda does not time out. For example, the jump in duration around the 11:30 mark corresponds with a spike in cold starts at the same time.
  • These cold starts also correspond to times when the number of concurrent requests increases. This also makes sense because more requests means additional environments need to be provisioned and thus cold starts are necessary.

The questions I have are the following: Why would an increased quantity of cold starts increase the amount of time each individual lambda takes to execute?

The above screenshots show the cold starts were required even for the batches that all executed successfully, so why does the task duration balloon when a large number of lambdas require a cold start?

@dontirun
Copy link
Contributor

The questions I have are the following: Why would an increased quantity of cold starts increase the amount of time each individual lambda takes to execute?
The above screenshots show the cold starts were required even for the batches that all executed successfully, so why does the task duration balloon when a large number of lambdas require a cold start?

From the perspective of the Lambda Application code, a cold start means that the ClamAV and Freshclam process both need to start up. Both of those take up CPU/memory. In addition, Freshclam will need to update the local virus database from the EFS filesystem.

As to why the coldstarts are happening so frequently, that would need to be answered by AWS Support

@jrb331
Copy link
Author

jrb331 commented Jul 19, 2024

Is it possible to increase the verbosity of the logs? Also, have you seen certain CPU/memory configurations function better than others? I know that 10,240 MB of memory is the default but is that recommended? I currently have 3008 MB provisioned.

@dontirun
Copy link
Contributor

Is it possible to increase the verbosity of the logs

What specifically are you looking for? It's possible to add log statements as the Scanning Lambda function enters each "phase" of the process. The Lambda does log when a specific process throws an error, but timeouts wouldn't be logged in CloudWatch Logs as the execution halts.

@jrb331
Copy link
Author

jrb331 commented Jul 19, 2024

I'd like to see if I can narrow in on where the execution is timing out. Adding log statements at each phase sounds like it would help achieve this. Could you help me understand how to enable this?

@dontirun
Copy link
Contributor

Also, have you seen certain CPU/memory configurations function better than others? I know that 10,240 MB of memory is the default but is that recommended? I currently have 3008 MB provisioned.

This is going to be use case specific (file type, file size, scan frequency). I haven't performed testing that would provide any real guidance. I have personally seen that larger configurations result in faster executions both in cold starts and overall.

@jrb331
Copy link
Author

jrb331 commented Jul 19, 2024

Amazing. Thank you for that. Is it possible to pull these changes into my project without a new version of the cdk-serverless-clamscan package?

@dontirun
Copy link
Contributor

Amazing. Thank you for that. Is it possible to pull these changes into my project without a new version of the cdk-serverless-clamscan package?

You would have to manually edit your local files to reflect the changes. I can not guarantee that manual changes like this won't result in a failed deployment

@jrb331
Copy link
Author

jrb331 commented Jul 19, 2024

Apologies if I'm misunderstanding your recommendation, but I don't host the files of this repo locally. I import them into my node project via the cdk-serverless-clamscan package and instantiate a ServerlessClamscan object in my cdk code. Would I have to host my own custom fork of this repo in order to achieve enhanced logging?

@jrb331
Copy link
Author

jrb331 commented Jul 19, 2024

Also, have you seen certain CPU/memory configurations function better than others? I know that 10,240 MB of memory is the default but is that recommended? I currently have 3008 MB provisioned.

This is going to be use case specific (file type, file size, scan frequency). I haven't performed testing that would provide any real guidance. I have personally seen that larger configurations result in faster executions both in cold starts and overall.

Also, FWIW, I tried increasing the lambda memory to 10240 and am still facing the same issue. I also validated that the files being sent to the construct have sizes on the order of KBs, which I've been able to handle in the past

@dontirun
Copy link
Contributor

Apologies if I'm misunderstanding your recommendation, but I don't host the files of this repo locally. I import them into my node project via the cdk-serverless-clamscan package and instantiate a ServerlessClamscan object in my cdk code. Would I have to host my own custom fork of this repo in order to achieve enhanced logging?

I recommend that you update to the latest version. If you don't want to (not recommended) you would need to directly modify the files in your node_modulesfolder and then deploy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants