[Bug] CrashLoopBackOff starting ChatQnA on a single-node k8s cluster #1202

arun-gupta · 2024-11-27T22:51:03Z

Priority

Undecided

OS type

Ubuntu

Hardware type

Xeon-SPR

Installation method

Pull docker images from hub.docker.com
Build docker images from source

Deploy method

Docker compose
Docker
Kubernetes
Helm

Running nodes

Single Node

What's the version?

1.1

Description

Following the instructions outlined in opea-project/docs#179 and trying to get ChatQnA running on a single-node k8s cluster using Helm charts.

Environment:

Ubuntu 24.04 on EC2 instance

ubuntu@ip-172-31-88-124:~$ minikube version
minikube version: v1.34.0
commit: 210b148df93a80eb872ecbeb7e35281b3c582c61

TGI server is failing to start:

{"timestamp":"2024-11-27T22:44:00.951312Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\n2024-11-27 22:42:25.931 | INFO     | text_generation_server.utils.import_utils:<module>:80 - Detected system ipex\n/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/sgmv.py:18: UserWarning: Could not import SGMV kernel from Punica, falling back to loop.\n  warnings.warn(\"Could not import SGMV kernel from Punica, falling back to loop.\")\nget_mempolicy: Operation not permitted\nset_mempolicy: Operation not permitted\nget_mempolicy: Operation not permitted\n/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:83: FutureWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.\n  return func(*args, **kwargs)"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-11-27T22:44:00.952080Z","level":"ERROR","fields":{"message":"Shard process was signaled to shutdown with signal 9"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-11-27T22:44:00.972241Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T22:44:00.972260Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
Error: ShardCannotStart

The pod tries to restart itself and then get into a CrashLoopBackoff:

ubuntu@ip-172-31-88-124:~$ kubectl get pods
NAME                                      READY   STATUS             RESTARTS      AGE
chatqna-79dccffcfd-44rh6                  1/1     Running            0             37m
chatqna-chatqna-ui-6cfdcfbd7d-lxqnk       1/1     Running            0             37m
chatqna-data-prep-7557c867b-mmjsh         1/1     Running            0             37m
chatqna-nginx-656bc748d4-5mkxt            1/1     Running            0             37m
chatqna-redis-vector-db-65cc8d87b-n7hv4   1/1     Running            0             37m
chatqna-retriever-usvc-65bfdf4f6b-d5rs9   1/1     Running            1 (37m ago)   37m
chatqna-tei-848578d78b-jmll2              1/1     Running            0             37m
chatqna-teirerank-86c76bb86b-86j7w        1/1     Running            0             37m
chatqna-tgi-f67c847b6-d42bq               0/1     CrashLoopBackOff   5 (31s ago)   37m

Reproduce steps

opea-project/docs#179

Raw log

No response

The text was updated successfully, but these errors were encountered:

wangkl2 · 2024-11-28T07:42:35Z

@arun-gupta From huggingface/text-generation-inference#451, the Shard process was signaled to shutdown with signal 9 error might be caused by not enough RAM to do the weights conversion.

I've tried deployed the ChatQnA with k8s using helm charts on both c7i.2xlarge and c7i.12xlarge,

On c7i.2xl with only 16GB host memory:
It keeps logging {"timestamp":"2024-11-28T07:18:07.707003Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} for the tgi pod, and I do reproduce the CrashLoopBackOff error you met.
On c7i.12xl with only 96GB host memory, the downloaded weights can be sharded successfully. And all services are ready on this node.

{"timestamp":"2024-11-28T05:17:54.921258Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","
span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-11-28T05:17:57.640911Z","level":"INFO","fields":{"message":"Using prefix caching = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:57.640945Z","level":"INFO","fields":{"message":"Using Attention = paged"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:57.882289Z","level":"WARN","fields":{"message":"Could not import Mamba: No module named 'mamba_ssm'"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:58.074754Z","level":"INFO","fields":{"message":"affinity=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], membind = [0]"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:59.674064Z","level":"INFO","fields":{"message":"Using experimental prefill chunking = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:59.815136Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-0"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:59.833034Z","level":"INFO","fields":{"message":"Shard ready in 4.906834084s"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}

May I ask which AWS instance are you using and how much memory is it equipped?

eero-t · 2024-11-28T13:37:42Z

This is due to pods not specifying their (CPU/memory) resource requests. The problem in specifying those is resource usage depending on the model and data types specified for the service. The default mistral model needs >30GB RAM with FP32, half of that with FP16/BF16 (supported only by very latest processors).

With correct resource requests, pod would be in Pending state when there's no node with enough free resources, instead of constantly using huge amount of CPU in crash loop...

See discussion: opea-project/GenAIInfra#431

arun-gupta · 2024-11-28T22:23:16Z

On c7i.12xl with only 96GB host memory, the downloaded weights can be sharded successfully. And all services are ready on this node.

{"timestamp":"2024-11-28T05:17:54.921258Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","
span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-11-28T05:17:57.640911Z","level":"INFO","fields":{"message":"Using prefix caching = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:57.640945Z","level":"INFO","fields":{"message":"Using Attention = paged"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:57.882289Z","level":"WARN","fields":{"message":"Could not import Mamba: No module named 'mamba_ssm'"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:58.074754Z","level":"INFO","fields":{"message":"affinity=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], membind = [0]"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:59.674064Z","level":"INFO","fields":{"message":"Using experimental prefill chunking = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:59.815136Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-0"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:59.833034Z","level":"INFO","fields":{"message":"Shard ready in 4.906834084s"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}

May I ask which AWS instance are you using and how much memory is it equipped?

I was using m7i.4xl. Now I tried with the c7i.8xl instance type. Most of the pods are able to start:

ubuntu@ip-172-31-94-2:~/GenAIInfra/helm-charts$ kubectl get pods
NAME                                      READY   STATUS    RESTARTS      AGE
chatqna-79dccffcfd-gpplm                  1/1     Running   0             32m
chatqna-chatqna-ui-6cfdcfbd7d-4dtd5       1/1     Running   0             32m
chatqna-data-prep-7557c867b-khdkc         1/1     Running   0             32m
chatqna-nginx-656bc748d4-xmz65            1/1     Running   0             32m
chatqna-redis-vector-db-65cc8d87b-8pzch   1/1     Running   0             32m
chatqna-retriever-usvc-65bfdf4f6b-8zfq6   1/1     Running   5 (25m ago)   32m
chatqna-tei-848578d78b-kdj8r              1/1     Running   0             32m
chatqna-teirerank-86c76bb86b-c9vm6        1/1     Running   0             32m
chatqna-tgi-f67c847b6-fnckn               0/1     Running   3 (53s ago)   32m

But TGI pod is still giving an error:

{"timestamp":"2024-11-28T22:19:32.533040Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\n2024-11-28 22:18:31.693 | INFO     | text_generation_server.utils.import_utils:<module>:80 - Detected system ipex\n/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/sgmv.py:18: UserWarning: Could not import SGMV kernel from Punica, falling back to loop.\n  warnings.warn(\"Could not import SGMV kernel from Punica, falling back to loop.\")\nget_mempolicy: Operation not permitted\nset_mempolicy: Operation not permitted\nget_mempolicy: Operation not permitted\n/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:83: FutureWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.\n  return func(*args, **kwargs)"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-11-28T22:19:32.538073Z","level":"ERROR","fields":{"message":"Shard process was signaled to shutdown with signal 9"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-11-28T22:19:32.589189Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T22:19:32.589219Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
Error: ShardCannotStart

It is essential that we provide the minimum vCPU and memory requirement as part of the instructions. @devpramod

I've requested a service quota increase so that a larger instance type can be started, for example c7i.12xl.

wangkl2 · 2024-11-29T03:50:06Z

Yes. It seems that the m7i.4xlarge instance mentioned here is not suitable for the default setting of the K8S deployment with helm charts. It uses neural-chat-7b and fp32 dtype in tgi pod by default.

We could recommend to book and utilize a larger AWS instance type or switch to even smaller models or utilize bfloat16 dtype (available for c7i/m7i type and above) for lower memory consumption.

arun-gupta · 2024-11-29T19:36:28Z

I could successfully start the pods with m7i.12xl instance type. There is a clear need to document the CPU/memory requirements.

devpramod · 2024-12-06T15:03:27Z

Hi @wangkl2 Are the CPU, memory and disk requirements for ChatQnA documented for K8s?

wangkl2 · 2024-12-13T08:49:35Z

Hi @wangkl2 Are the CPU, memory and disk requirements for ChatQnA documented for K8s?

No. Apparently, the m7i.4xlarge instance with at least 100 GB disk size documented here (I think it should have been verified with docker compose deployment) is not suitable for the default setting of the K8S deployment with helm charts, which caused the model converting issue and CrashLoopBackOff phenomenon. From the testing of Arun and I, either m7i.12xl or c7i.12xl for k8s with helm charts deployment should be fine. BTW, if switching to bf16 by default, the constraint for memory can be further releaxed.

arun-gupta mentioned this issue Nov 27, 2024

add k8s docs for getting started, K8s Manifest and Helm opea-project/docs#179

Open

lvliang-intel assigned wangkl2 Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] CrashLoopBackOff starting ChatQnA on a single-node k8s cluster #1202

[Bug] CrashLoopBackOff starting ChatQnA on a single-node k8s cluster #1202

arun-gupta commented Nov 27, 2024

wangkl2 commented Nov 28, 2024 •

edited

Loading

eero-t commented Nov 28, 2024 •

edited

Loading

arun-gupta commented Nov 28, 2024

wangkl2 commented Nov 29, 2024

arun-gupta commented Nov 29, 2024

devpramod commented Dec 6, 2024 •

edited

Loading

wangkl2 commented Dec 13, 2024

[Bug] CrashLoopBackOff starting ChatQnA on a single-node k8s cluster #1202

[Bug] CrashLoopBackOff starting ChatQnA on a single-node k8s cluster #1202

Comments

arun-gupta commented Nov 27, 2024

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Reproduce steps

Raw log

wangkl2 commented Nov 28, 2024 • edited Loading

eero-t commented Nov 28, 2024 • edited Loading

arun-gupta commented Nov 28, 2024

wangkl2 commented Nov 29, 2024

arun-gupta commented Nov 29, 2024

devpramod commented Dec 6, 2024 • edited Loading

wangkl2 commented Dec 13, 2024

wangkl2 commented Nov 28, 2024 •

edited

Loading

eero-t commented Nov 28, 2024 •

edited

Loading

devpramod commented Dec 6, 2024 •

edited

Loading