[Bug]: IndexNode OOM after upgrading from 2.3.12 to 2.4.4 #34273

artinshahverdian · 2024-06-28T15:02:49Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.4.4
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2): N/A
- OS(Ubuntu or CentOS): 
- CPU/Memory: 4vCPU/8GB
- GPU: N/A
- Others: N/A

Current Behavior

I am running Milvus 2.4.4 in cluster mode on AWS EKS. The I am seeing the indexnode being crashed while it's trying to index. I have just upgraded from 2.3.12 to 2.4.4 and have a dedicated nodegroup for the indexnode. The machine has 8GB memory. Why would the indexnode work fine in 2.3.12 with the same memory and get OOM after upgrading to 2.4.4. Anything I'm missing? Logs for indexnode are included from start until the crash. Logs are set at info level.
After upgrading to a 16GB Node, the memory usage didn't go above 6GB and it dropped multiple times and grew. I suspect Milvus is not monitoring memory usage and doesn't kick off a garbage collection before using more memory.

My segment size and max segment size are the default and I have not overridden anything.

indexnode.log

Expected Behavior

Indexnode should work fine as it was in 2.3.12 with an 8GB machine and run garbage collection periodically.

Steps To Reproduce

No response

Milvus Log

indexnode.log

Anything else?

No response

yanliang567 · 2024-06-29T02:07:58Z

Checking the logs, I did not see anything doubtful. The default value of max segment size changed from 512MB to 1024GB, that's the only suspected point I can think of. @artinshahverdian quick question: how did you observe the index memory usage?
@xiaocai2333 please help to double check.
/assign @xiaocai2333
/unassign

xiaocai2333 · 2024-07-01T06:17:44Z

[2024/06/28 02:26:52.193 +00:00] [INFO] [indexnode/indexnode_service.go:56] ["IndexNode building index ..."] [traceID=54bc5ee90b17f46a0dbf953e1779ce67] [clusterID=by-dev] [indexBuildID=450766058328658389] [collectionID=447757774238601435] [indexID=0] [indexName=] [indexFilePrefix=index_files] [indexVersion=18] [dataPaths="[]"] [typeParams="[{\"key\":\"dim\",\"value\":\"1536\"}]"] [indexParams="[{\"key\":\"M\",\"value\":\"16\"},{\"key\":\"index_type\",\"value\":\"HNSW\"},{\"key\":\"metric_type\",\"value\":\"L2\"},{\"key\":\"efConstruction\",\"value\":\"50\"}]"] [numRows=398898] [current_index_version=4] [storepath=] [storeversion=0] [indexstorepath=] [dim=0]

[2024/06/28 02:26:52.920 +00:00] [INFO] [indexnode/task.go:516] ["index params are ready"] [buildID=450766058328658389] ["index params"="{\"M\":\"16\",\"dim\":\"1536\",\"efConstruction\":\"50\",\"index_type\":\"HNSW\",\"metric_type\":\"L2\"}"]

According to the log information, the size of the newly segment to build index is 398898*1536*4/1024/1024 2337.29MB. An 8GB indexnode is not sufficient for such a large segment. Please check if you changed the segment's MaxSize configuration during the upgrade, which might have caused the compaction to generate larger segments.

yanliang567 · 2024-07-02T06:17:37Z

/assign @artinshahverdian
/unassign @xiaocai2333

artinshahverdian · 2024-07-02T19:00:53Z

can confirm the segment size default value is changed in 2.4.4:

segment:
    maxSize: 1024 # Maximum size of a segment in MB
    diskSegmentMaxSize: 2048 # Maximun size of a segment in MB for collection which has Disk index

these are my configs now. If I reduce these to:

segment:
    maxSize: 512 # Maximum size of a segment in MB
    diskSegmentMaxSize: 1024 # Maximun size of a segment in MB for collection which has Disk index

and trigger compaction, will I get smaller segments and can I use an 8GB machine for indexNode or the existing segments cannot change anymore?
cc: @xiaocai2333

xiaocai2333 · 2024-07-03T02:11:49Z

It is no way to reduce the segment size through compaction. The recommended approach is to scale up the indexnode memory to 10GB; for 2.3GB segment, 10GB of memory should be sufficient for building the index.
But it is strange, your index type is HNSW, but the segment size is 2GB.
@artinshahverdian please confirm whether you have changed the segment.MaxSize or if you have ever built a DISKANN index.

artinshahverdian · 2024-07-04T13:35:47Z

@artinshahverdian I have not changed the segment size or built a disk index. Is there anyway I can find the big segment and verify the size?

artinshahverdian · 2024-07-04T13:36:35Z

@xiaocai2333 do you see any downside of changing the segment size back to 512?

artinshahverdian added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 28, 2024

artinshahverdian assigned yanliang567 Jun 28, 2024

sre-ci-robot assigned xiaocai2333 and unassigned yanliang567 Jun 29, 2024

yanliang567 added help wanted Extra attention is needed and removed kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 29, 2024

sre-ci-robot assigned artinshahverdian and unassigned xiaocai2333 Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: IndexNode OOM after upgrading from 2.3.12 to 2.4.4 #34273

[Bug]: IndexNode OOM after upgrading from 2.3.12 to 2.4.4 #34273

artinshahverdian commented Jun 28, 2024

yanliang567 commented Jun 29, 2024

xiaocai2333 commented Jul 1, 2024

yanliang567 commented Jul 2, 2024

artinshahverdian commented Jul 2, 2024 •

edited

Loading

xiaocai2333 commented Jul 3, 2024

artinshahverdian commented Jul 4, 2024

artinshahverdian commented Jul 4, 2024

[Bug]: IndexNode OOM after upgrading from 2.3.12 to 2.4.4 #34273

[Bug]: IndexNode OOM after upgrading from 2.3.12 to 2.4.4 #34273

Comments

artinshahverdian commented Jun 28, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Jun 29, 2024

xiaocai2333 commented Jul 1, 2024

yanliang567 commented Jul 2, 2024

artinshahverdian commented Jul 2, 2024 • edited Loading

xiaocai2333 commented Jul 3, 2024

artinshahverdian commented Jul 4, 2024

artinshahverdian commented Jul 4, 2024

artinshahverdian commented Jul 2, 2024 •

edited

Loading