Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: L0 compaction cannot keep up with upsert; dataNode memory usage suddenly increases for no apparent reason #34258

Open
1 task done
ThreadDao opened this issue Jun 28, 2024 · 0 comments
Assignees
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4-20240624-59d91032-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar  
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

deploy milvus with dataNode and config

    dataNode:
      replicas: 1
      resources:
        limits:
          cpu: "8" 
          memory: 16Gi
        requests:
          cpu: "4" 
          memory: 8Gi 
  config:
    log:
      level: debug
    trace:
      exporter: jaeger
      jaeger:
        url: http://tempo-distributor.tempo:14268/api/traces
      sampleFraction: 1

test steps

  1. create a collection with 1024 partitions (partition-key) and 1 shard
  2. create index
  3. insert 1b-128d data -> flush
  4. index again
  5. concurrent upsert with client config
    image

result

  1. L0 compaction cannot keep up with upsert
    After about 5 hours, L0 compaction cannot keep up with upsert, and the number of flushed-L0 segments increases significantly.
    The reason may be that the available memory of L0-compaction is relatively small. Because the available memory increases after the upsert is completed, the L0-compaction latency is reduced to 13 minutes
    grafana link
    image

  2. L0 compaction after upsert completed
    grafana link
    image

  3. dataNode memory usage during L0-compaction
    After the upsert is completed, the available memory for L0 compaction increases and the L0-compaction latency is significantly reduced. However, the dn memory usage rises to 80% for no apparent reason.
    grafana link
    image

Expected Behavior

No response

Steps To Reproduce

1. https://argo-workflows.zilliz.cc/archived-workflows/qa/b45e3176-1bbc-4881-9e86-69d30b71af13?nodeId=compact-opt-1b-no-flush-1

2. https://argo-workflows.zilliz.cc/archived-workflows/qa/3337f476-2a62-4b1b-8d2a-2364a449942c?nodeId=compact-opt-1b-no-flush-1a

Milvus Log

compact-no-flush-1b1-etcd-0                                       1/1     Running                           0               3d18h   10.104.33.195   4am-node36   <none>           <none>
compact-no-flush-1b1-etcd-1                                       1/1     Running                           0               3d18h   10.104.32.62    4am-node39   <none>           <none>
compact-no-flush-1b1-etcd-2                                       0/1     Pending                           0               88m     <none>          <none>       <none>           <none>
compact-no-flush-1b1-milvus-datanode-6fb95d86b6-8l7l7             1/1     Running                           0               109m    10.104.19.166   4am-node28   <none>           <none>
compact-no-flush-1b1-milvus-datanode-6fb95d86b6-n5wdm             0/1     Completed                         25 (25h ago)    3d17h   10.104.18.77    4am-node25   <none>           <none>
compact-no-flush-1b1-milvus-indexnode-556f885dbd-hftd2            1/1     Running                           0               3d17h   10.104.19.8     4am-node28   <none>           <none>
compact-no-flush-1b1-milvus-indexnode-556f885dbd-qb4t5            1/1     Running                           0               3d17h   10.104.13.180   4am-node16   <none>           <none>
compact-no-flush-1b1-milvus-indexnode-556f885dbd-vfmsj            1/1     Running                           0               3d17h   10.104.1.31     4am-node10   <none>           <none>
compact-no-flush-1b1-milvus-mixcoord-58dcd87968-r5dk9             1/1     Running                           0               3d17h   10.104.5.71     4am-node12   <none>           <none>
compact-no-flush-1b1-milvus-proxy-7b79848cbc-mc647                1/1     Running                           0               3d17h   10.104.4.222    4am-node11   <none>           <none>
compact-no-flush-1b1-milvus-querynode-0-787ff474fd-7hg4r          1/1     Running                           54 (2d3h ago)   3d17h   10.104.14.186   4am-node18   <none>           <none>
compact-no-flush-1b1-minio-0                                      1/1     Running                           0               3d18h   10.104.33.196   4am-node36   <none>           <none>
compact-no-flush-1b1-minio-1                                      0/1     Pending                           0               87m     <none>          <none>       <none>           <none>
compact-no-flush-1b1-minio-2                                      1/1     Running                           0               3d18h   10.104.32.64    4am-node39   <none>           <none>
compact-no-flush-1b1-minio-3                                      1/1     Running                           0               3d18h   10.104.30.76    4am-node38   <none>           <none>
compact-no-flush-1b1-pulsar-bookie-0                              1/1     Running                           0               3d18h   10.104.33.198   4am-node36   <none>           <none>
compact-no-flush-1b1-pulsar-bookie-1                              1/1     Running                           0               3d18h   10.104.32.66    4am-node39   <none>           <none>
compact-no-flush-1b1-pulsar-bookie-2                              0/1     Pending                           0               91m     <none>          <none>       <none>           <none>
compact-no-flush-1b1-pulsar-broker-0                              1/1     Running                           0               3d18h   10.104.4.220    4am-node11   <none>           <none>
compact-no-flush-1b1-pulsar-proxy-0                               1/1     Running                           0               3d18h   10.104.1.29     4am-node10   <none>           <none>
compact-no-flush-1b1-pulsar-recovery-0                            1/1     Running                           0               3d18h   10.104.5.69     4am-node12   <none>           <none>
compact-no-flush-1b1-pulsar-zookeeper-0                           1/1     Running                           0               3d18h   10.104.33.197   4am-node36   <none>           <none>
compact-no-flush-1b1-pulsar-zookeeper-1                           1/1     Running                           0               3d18h   10.104.32.68    4am-node39   <none>           <none>
compact-no-flush-1b1-pulsar-zookeeper-2                           1/1     Running                           0               3d18h   10.104.17.96    4am-node23   <none>           <none>

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 28, 2024
@ThreadDao ThreadDao added the severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. label Jun 28, 2024
@ThreadDao ThreadDao added this to the 2.4.6 milestone Jun 28, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 29, 2024
@yanliang567 yanliang567 removed their assignment Jun 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants