Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] CrashLoopBackOff starting ChatQnA on a single-node k8s cluster #1202

Open
6 tasks
arun-gupta opened this issue Nov 27, 2024 · 7 comments
Open
6 tasks
Assignees

Comments

@arun-gupta
Copy link
Contributor

Priority

Undecided

OS type

Ubuntu

Hardware type

Xeon-SPR

Installation method

  • Pull docker images from hub.docker.com
  • Build docker images from source

Deploy method

  • Docker compose
  • Docker
  • Kubernetes
  • Helm

Running nodes

Single Node

What's the version?

1.1

Description

Following the instructions outlined in opea-project/docs#179 and trying to get ChatQnA running on a single-node k8s cluster using Helm charts.

Environment:

Ubuntu 24.04 on EC2 instance

ubuntu@ip-172-31-88-124:~$ minikube version
minikube version: v1.34.0
commit: 210b148df93a80eb872ecbeb7e35281b3c582c61

TGI server is failing to start:

{"timestamp":"2024-11-27T22:44:00.951312Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\n2024-11-27 22:42:25.931 | INFO     | text_generation_server.utils.import_utils:<module>:80 - Detected system ipex\n/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/sgmv.py:18: UserWarning: Could not import SGMV kernel from Punica, falling back to loop.\n  warnings.warn(\"Could not import SGMV kernel from Punica, falling back to loop.\")\nget_mempolicy: Operation not permitted\nset_mempolicy: Operation not permitted\nget_mempolicy: Operation not permitted\n/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:83: FutureWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.\n  return func(*args, **kwargs)"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-11-27T22:44:00.952080Z","level":"ERROR","fields":{"message":"Shard process was signaled to shutdown with signal 9"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-11-27T22:44:00.972241Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T22:44:00.972260Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
Error: ShardCannotStart

The pod tries to restart itself and then get into a CrashLoopBackoff:

ubuntu@ip-172-31-88-124:~$ kubectl get pods
NAME                                      READY   STATUS             RESTARTS      AGE
chatqna-79dccffcfd-44rh6                  1/1     Running            0             37m
chatqna-chatqna-ui-6cfdcfbd7d-lxqnk       1/1     Running            0             37m
chatqna-data-prep-7557c867b-mmjsh         1/1     Running            0             37m
chatqna-nginx-656bc748d4-5mkxt            1/1     Running            0             37m
chatqna-redis-vector-db-65cc8d87b-n7hv4   1/1     Running            0             37m
chatqna-retriever-usvc-65bfdf4f6b-d5rs9   1/1     Running            1 (37m ago)   37m
chatqna-tei-848578d78b-jmll2              1/1     Running            0             37m
chatqna-teirerank-86c76bb86b-86j7w        1/1     Running            0             37m
chatqna-tgi-f67c847b6-d42bq               0/1     CrashLoopBackOff   5 (31s ago)   37m

Reproduce steps

opea-project/docs#179

Raw log

No response

@wangkl2
Copy link
Collaborator

wangkl2 commented Nov 28, 2024

@arun-gupta From huggingface/text-generation-inference#451, the Shard process was signaled to shutdown with signal 9 error might be caused by not enough RAM to do the weights conversion.

I've tried deployed the ChatQnA with k8s using helm charts on both c7i.2xlarge and c7i.12xlarge,

  • On c7i.2xl with only 16GB host memory:
    It keeps logging {"timestamp":"2024-11-28T07:18:07.707003Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} for the tgi pod, and I do reproduce the CrashLoopBackOff error you met.
    image
  • On c7i.12xl with only 96GB host memory, the downloaded weights can be sharded successfully. And all services are ready on this node.
{"timestamp":"2024-11-28T05:17:54.921258Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","
span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-11-28T05:17:57.640911Z","level":"INFO","fields":{"message":"Using prefix caching = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:57.640945Z","level":"INFO","fields":{"message":"Using Attention = paged"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:57.882289Z","level":"WARN","fields":{"message":"Could not import Mamba: No module named 'mamba_ssm'"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:58.074754Z","level":"INFO","fields":{"message":"affinity=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], membind = [0]"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:59.674064Z","level":"INFO","fields":{"message":"Using experimental prefill chunking = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:59.815136Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-0"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:59.833034Z","level":"INFO","fields":{"message":"Shard ready in 4.906834084s"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}

May I ask which AWS instance are you using and how much memory is it equipped?
image

@eero-t
Copy link
Contributor

eero-t commented Nov 28, 2024

This is due to pods not specifying their (CPU/memory) resource requests. The problem in specifying those is resource usage depending on the model and data types specified for the service. The default mistral model needs >30GB RAM with FP32, half of that with FP16/BF16 (supported only by very latest processors).

With correct resource requests, pod would be in Pending state when there's no node with enough free resources, instead of constantly using huge amount of CPU in crash loop...

See discussion: opea-project/GenAIInfra#431

@arun-gupta
Copy link
Contributor Author

  • On c7i.12xl with only 96GB host memory, the downloaded weights can be sharded successfully. And all services are ready on this node.
{"timestamp":"2024-11-28T05:17:54.921258Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","
span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-11-28T05:17:57.640911Z","level":"INFO","fields":{"message":"Using prefix caching = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:57.640945Z","level":"INFO","fields":{"message":"Using Attention = paged"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:57.882289Z","level":"WARN","fields":{"message":"Could not import Mamba: No module named 'mamba_ssm'"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:58.074754Z","level":"INFO","fields":{"message":"affinity=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], membind = [0]"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:59.674064Z","level":"INFO","fields":{"message":"Using experimental prefill chunking = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:59.815136Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-0"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T05:17:59.833034Z","level":"INFO","fields":{"message":"Shard ready in 4.906834084s"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}

May I ask which AWS instance are you using and how much memory is it equipped?

I was using m7i.4xl. Now I tried with the c7i.8xl instance type. Most of the pods are able to start:

ubuntu@ip-172-31-94-2:~/GenAIInfra/helm-charts$ kubectl get pods
NAME                                      READY   STATUS    RESTARTS      AGE
chatqna-79dccffcfd-gpplm                  1/1     Running   0             32m
chatqna-chatqna-ui-6cfdcfbd7d-4dtd5       1/1     Running   0             32m
chatqna-data-prep-7557c867b-khdkc         1/1     Running   0             32m
chatqna-nginx-656bc748d4-xmz65            1/1     Running   0             32m
chatqna-redis-vector-db-65cc8d87b-8pzch   1/1     Running   0             32m
chatqna-retriever-usvc-65bfdf4f6b-8zfq6   1/1     Running   5 (25m ago)   32m
chatqna-tei-848578d78b-kdj8r              1/1     Running   0             32m
chatqna-teirerank-86c76bb86b-c9vm6        1/1     Running   0             32m
chatqna-tgi-f67c847b6-fnckn               0/1     Running   3 (53s ago)   32m

But TGI pod is still giving an error:

{"timestamp":"2024-11-28T22:19:32.533040Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\n2024-11-28 22:18:31.693 | INFO     | text_generation_server.utils.import_utils:<module>:80 - Detected system ipex\n/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/sgmv.py:18: UserWarning: Could not import SGMV kernel from Punica, falling back to loop.\n  warnings.warn(\"Could not import SGMV kernel from Punica, falling back to loop.\")\nget_mempolicy: Operation not permitted\nset_mempolicy: Operation not permitted\nget_mempolicy: Operation not permitted\n/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:83: FutureWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.\n  return func(*args, **kwargs)"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-11-28T22:19:32.538073Z","level":"ERROR","fields":{"message":"Shard process was signaled to shutdown with signal 9"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-11-28T22:19:32.589189Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-28T22:19:32.589219Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
Error: ShardCannotStart

It is essential that we provide the minimum vCPU and memory requirement as part of the instructions. @devpramod

I've requested a service quota increase so that a larger instance type can be started, for example c7i.12xl.

@wangkl2
Copy link
Collaborator

wangkl2 commented Nov 29, 2024

Yes. It seems that the m7i.4xlarge instance mentioned here is not suitable for the default setting of the K8S deployment with helm charts. It uses neural-chat-7b and fp32 dtype in tgi pod by default.

We could recommend to book and utilize a larger AWS instance type or switch to even smaller models or utilize bfloat16 dtype (available for c7i/m7i type and above) for lower memory consumption.

@arun-gupta
Copy link
Contributor Author

I could successfully start the pods with m7i.12xl instance type. There is a clear need to document the CPU/memory requirements.

@devpramod
Copy link
Collaborator

devpramod commented Dec 6, 2024

Hi @wangkl2 Are the CPU, memory and disk requirements for ChatQnA documented for K8s?

@wangkl2
Copy link
Collaborator

wangkl2 commented Dec 13, 2024

Hi @wangkl2 Are the CPU, memory and disk requirements for ChatQnA documented for K8s?

No. Apparently, the m7i.4xlarge instance with at least 100 GB disk size documented here (I think it should have been verified with docker compose deployment) is not suitable for the default setting of the K8S deployment with helm charts, which caused the model converting issue and CrashLoopBackOff phenomenon. From the testing of Arun and I, either m7i.12xl or c7i.12xl for k8s with helm charts deployment should be fine. BTW, if switching to bf16 by default, the constraint for memory can be further releaxed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants