Potential memory leak (in s3util?) #7

abought · 2024-03-31T17:44:14Z

Summary

A memory leak in cloudgene forces frequent restarts- the most recent deployment lasted ~3 weeks before reaching the danger threshold for intervention. Symptoms include decreased export performance, elevated 5xx error rates to users, and slow site responsiveness (90%ile response time > 10 seconds).

We have consistently seen evidence of memory leaks and issues across our last several deploys. As usage increases, these leaks force restarts ever more often- potentially > 1 / month.

Details/ evidence

Initial limited exploration suggests that there may be a memory leak in the S3 upload functionality, among other places. Below are the top 50 items from jmap -histo <pid>, taken just before a recent restart. Note that there are 842k instances of com.amazonaws.services.s3.model.UploadPartRequest, even after the queue was completely drained and no jobs/uploads were in progress.

 num     #instances         #bytes  class name
----------------------------------------------
   1:     177150799     5668825568  java.util.Hashtable$Entry
   2:     173521201     5552678432  java.util.concurrent.ConcurrentHashMap$Node
   3:        370453     2292378272  [Ljava.util.concurrent.ConcurrentHashMap$Node;
   4:        280325     1744738896  [Ljava.util.Hashtable$Entry;
   5:      57184250     1376673256  [Ljava.lang.String;
   6:       5696954     1116222200  [B
   7:       9562157     1030570944  [C
   8:       1291734      913105408  [I
   9:       9087848      218108352  java.lang.String
  10:       2188009      110985472  [Ljava.lang.Object;
  11:        842030       94307360  com.amazonaws.services.s3.model.UploadPartRequest
  12:       1590783       76357584  java.util.HashMap
  13:        545638       65116480  java.lang.Class
  14:       1600795       38419080  com.google.protobuf.ByteString$LiteralByteString
  15:        842326       33693040  java.util.EnumMap
  16:        957209       30630688  java.util.HashMap$Node
  17:        930378       29772096  java.util.concurrent.FutureTask
  18:        461692       29548288  java.util.concurrent.ConcurrentHashMap
  19:        395224       28456128  org.apache.hadoop.yarn.proto.YarnProtos$ResourceInformationProto
  20:       1745499       27927984  java.lang.Object
  21:        132294       26458800  org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage
  22:       1075438       25810512  java.util.ArrayList
  23:        249083       24671232  [Ljava.util.HashMap$Node;
  24:        929288       22302912  com.amazonaws.services.s3.model.PartETag
  25:        842216       20213184  com.amazonaws.RequestClientOptions
  26:        841642       20199408  com.amazonaws.services.s3.transfer.internal.UploadPartCallable
  27:        840771       20178504  java.util.concurrent.LinkedBlockingQueue$Node
  28:       1032731       16523696  com.mysql.cj.result.StringValueFactory
  29:        263529       14757624  org.apache.hadoop.yarn.proto.YarnProtos$StringLongMapProto
  30:        127995       14335440  java.util.GregorianCalendar
  31:        144545       13886256  [[B
  32:        278233       13355184  java.util.Properties
  33:        197610       12647040  org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto
  34:        128288       12315648  sun.util.calendar.Gregorian$Date
  35:        479362       11504688  java.util.Collections$UnmodifiableRandomAccessList
  36:        286710       11468400  java.util.LinkedHashMap$Entry
  37:         65857       11063976  org.apache.hadoop.yarn.proto.YarnProtos$ApplicationReportProto
  38:        193168       10817408  org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto
  39:        125224       10017920  java.net.URI
  40:        138024        8833536  org.apache.hadoop.ipc.Client$ConnectionId
  41:        136587        8741568  cloudgene.mapred.core.User
  42:        264358        8459456  sun.nio.fs.UnixPath
  43:        173599        8332752  java.nio.HeapByteBuffer
  44:         65865        7376880  org.apache.hadoop.yarn.proto.YarnProtos$ApplicationResourceUsageReportProto
  45:        125767        7042952  java.util.Calendar$Builder
  46:        166697        6667880  java.lang.ref.Finalizer
  47:        138024        6625152  org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker
  48:        199409        6381088  java.util.LinkedList
  49:        173590        6182232  [J
  50:        128666        6175968  freemarker.core.Token

This appears to be leading to pressure on the garbage collector. (when evaluating GC total run time, in units of seconds, consider that the site was redeployed to fresh hardware ~1 month ago)

[user@ ~]$ sudo jstat -gc pid
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
1024.0 2048.0 896.0   0.0   2750464.0 1239804.9 20970496.0 19587020.8 3558952.0 3167623.3 444208.0 333253.0 446473 16369.459 1701  6032.675 22402.134

[user@ ~]$ sudo jstat -gccapacity pid
 NGCMN    NGCMX     NGC     S0C   S1C       EC      OGCMN      OGCMX       OGC         OC       MCMN     MCMX      MC     CCSMN    CCSMX     CCSC    YGC    FGC
698880.0 10484736.0 8955392.0 1024.0 2048.0 2750464.0  1398272.0 20970496.0 20970496.0 20970496.0      0.0 4163584.0 3558952.0      0.0 1048576.0 444208.0 446473  1701

[user@ ~]$ sudo jstat -gccause pid
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT    LGCC                 GCC
 87.50   0.00  46.19  93.40  89.00  75.02 446473 16369.459  1701 6032.675 22402.134 Allocation Failure   No GC

[user@ ~]$ sudo jstat -gcold pid
   MC       MU      CCSC     CCSU       OC          OU       YGC    FGC    FGCT     GCT
3558952.0 3167623.3 444208.0 333253.0  20970496.0  19587020.8 446473  1701 6032.675 22402.134

[user@ ~]$ sudo jstat -gcoldcapacity pid
   OGCMN       OGCMX        OGC         OC       YGC   FGC    FGCT     GCT
  1398272.0  20970496.0  20970496.0  20970496.0 446473  1701 6032.675 22402.134

[user@ ~]$ sudo jstat -gcutil pid
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
 87.50   0.00  46.20  93.40  89.00  75.02 446473 16369.459  1701 6032.675 22402.134

Largest memory user on the system by far.

[hadoop@ip-10-120-4-95 ~]$ ps aux --sort=%mem,%cpu | grep 20283
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user     pid  487 19.2 50717600 37628456 ?   Sl   Feb26 242354:41 java <redacted> cloudgene.mapred.CommandLineInterface server

The text was updated successfully, but these errors were encountered:

abought mentioned this issue Apr 19, 2024

Reduce polling times to prevent accidental DoS lukfor/imputationbot#33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memory leak (in s3util?) #7

Potential memory leak (in s3util?) #7

abought commented Mar 31, 2024 •

edited

Loading

Potential memory leak (in s3util?) #7

Potential memory leak (in s3util?) #7

Comments

abought commented Mar 31, 2024 • edited Loading

Summary

Details/ evidence

abought commented Mar 31, 2024 •

edited

Loading