You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A memory leak in cloudgene forces frequent restarts- the most recent deployment lasted ~3 weeks before reaching the danger threshold for intervention. Symptoms include decreased export performance, elevated 5xx error rates to users, and slow site responsiveness (90%ile response time > 10 seconds).
We have consistently seen evidence of memory leaks and issues across our last several deploys. As usage increases, these leaks force restarts ever more often- potentially > 1 / month.
Details/ evidence
Initial limited exploration suggests that there may be a memory leak in the S3 upload functionality, among other places. Below are the top 50 items from jmap -histo <pid>, taken just before a recent restart. Note that there are 842k instances of com.amazonaws.services.s3.model.UploadPartRequest, even after the queue was completely drained and no jobs/uploads were in progress.
This appears to be leading to pressure on the garbage collector. (when evaluating GC total run time, in units of seconds, consider that the site was redeployed to fresh hardware ~1 month ago)
Summary
A memory leak in cloudgene forces frequent restarts- the most recent deployment lasted ~3 weeks before reaching the danger threshold for intervention. Symptoms include decreased export performance, elevated 5xx error rates to users, and slow site responsiveness (90%ile response time > 10 seconds).
We have consistently seen evidence of memory leaks and issues across our last several deploys. As usage increases, these leaks force restarts ever more often- potentially > 1 / month.
Details/ evidence
Initial limited exploration suggests that there may be a memory leak in the S3 upload functionality, among other places. Below are the top 50 items from
jmap -histo <pid>
, taken just before a recent restart. Note that there are 842k instances ofcom.amazonaws.services.s3.model.UploadPartRequest
, even after the queue was completely drained and no jobs/uploads were in progress.This appears to be leading to pressure on the garbage collector. (when evaluating GC total run time, in units of seconds, consider that the site was redeployed to fresh hardware ~1 month ago)
Largest memory user on the system by far.
The text was updated successfully, but these errors were encountered: