[Bug]: Mistral 7B crashes on NVidia Tesla P100 with a CUDA Error #5219

oe3gwu · 2024-06-03T12:27:08Z

Your current environment

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04 LTS (x86_64)
GCC version: (Ubuntu 13.2.0-23ubuntu4) 13.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.39

Python version: 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-31-generic-x86_64-with-glibc2.39
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB
Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
CPU family: 6
Model: 62
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
Stepping: 4
CPU(s) scaling MHz: 89%
CPU max MHz: 3100.0000
CPU min MHz: 1200.0000
BogoMIPS: 5199.89
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts vnmi md_clear flush_l1d
Virtualization: VT-x
L1d cache: 384 KiB (12 instances)
L1i cache: 384 KiB (12 instances)
L2 cache: 3 MiB (12 instances)
L3 cache: 30 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-5,12-17
NUMA node1 CPU(s): 6-11,18-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Unknown: No mitigations
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] No relevant packages
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 6-11,18-23 1 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I wrote a simple docker-compose.yml that install vLLM and downloads mistral 7b. That worked. The --dtype=half is needed for the P100. However, after Mistral is downloaded, the container crashes with a CUDA error. As far as I understand, CUDA is deployed within the container. So that is nothing I can do about it.

name: vllm
services:
    vllm-app:
        container_name: vllm-app
        runtime: nvidia
        deploy:
            resources:
                reservations:
                    devices:
                        - driver: nvidia
                          count: all
                          capabilities:
                              - gpu
        volumes:
            - ./vllm/cache/huggingface:/root/.cache/huggingface
        environment:
            - HUGGING_FACE_HUB_TOKEN=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
        ports:
            - 8000:8000
        restart: unless-stopped
        pull_policy: always
        ipc: host
        image: vllm/vllm-openai:latest
        command: --model mistralai/Mistral-7B-v0.3 --dtype=half
        #command: --model facebook/opt-125m --enforce-eager
        #command: --model google/gemma-2b --dtype=float16

Error Log:

docker.log

Running Mistral 7B using Ollama works fine.

The text was updated successfully, but these errors were encountered:

robertgshaw2-neuralmagic · 2024-06-03T15:16:32Z

We do not support P100

dirkson · 2024-08-12T17:20:56Z

We do not support P100

I am confused at this reply and the closure of this ticket, as I believe there was (and is!) an open PR that adds support for this. #4409

Is there some non-obvious issue with the PR?

robertgshaw2-neuralmagic · 2024-08-12T17:24:30Z

We do not support P100

I am confused at this reply and the closure of this ticket, as I believe there was (and is!) an open PR that adds support for this. #4409

Is there some non-obvious issue with the PR?

The decision to not ship a P100 distribution was driven by 3 factors:

shipping P100 increases our binary size and we are already close to the PyPI limits as is
we do not have access to P100 resources to run our CI and have no way to test
it is a relatively easy for users to build vLLM with support for P100 and so there is a relatively painless workaround for motivated users

dirkson · 2024-08-12T19:04:17Z

The decision to not ship a P100 distribution was driven by 3 factors:

Where did you have this discussion? It doesn't seem to be on the PR I linked, and I haven't been able to find it with a casual search.

* shipping P100 increases our binary size and we are already close to the PyPI limits as is

The PR in question mentions pypi/support#3792 , which appears to up the limit to 400mb from 100. I think this is a solved issue?

* we do not have access to P100 resources to run our CI and have no way to test

I am surprised that you don't have access to p100 hardware. As I understand, it's literally the cheapest inference hardware at the moment. I'm sure that the community would respond if you requested access to someone else's P100's for testing.

* it is a relatively easy for users to build vLLM with support for P100 and so there is a relatively painless workaround for motivated users

Perhaps add this conclusion to your documentation? I had to hunt through numerous bugs and find a PR to figure out how to potentially run VLLM. If the only (un-? semi-?) supported way to run it on common hardware is via a third party repo, it seems reasonable to mention that in some detailed install documentation.

My apologies if this reply comes off a little grumpy. I get that our goals aren't really aligned here, and that p100 support seems silly from a business perspective, despite the relative accessibility of the hardware.

oe3gwu · 2024-08-12T20:20:21Z

I must agree to @dirkson . Doc how to compile vLLM yourself with P100 support is virtually non existent. And if it is so easy, why you simply dont do it from your project with the info untested? If you dont have the hardware, there are people out there to test it.

I want it for my private use, but actually this Bug Report made me use Ollama, because the argument that you dont have P100's is just a pseudo-argument. A 100 USD invest in eBay will let you get one.

Also I had entire schools (3 schools with approx. 40 cards each) that we installed now with ollama and open-webui, because vLLM (which would be the superrior tech) simply cant support it. And I am just 1 person who would have needed that support.

Edit - if this post comes a 2nd time I am sorry, I tried via eMail but didnt gone through until now.

oe3gwu added the bug Something isn't working label Jun 3, 2024

robertgshaw2-neuralmagic closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Mistral 7B crashes on NVidia Tesla P100 with a CUDA Error #5219

[Bug]: Mistral 7B crashes on NVidia Tesla P100 with a CUDA Error #5219

oe3gwu commented Jun 3, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Jun 3, 2024

dirkson commented Aug 12, 2024

robertgshaw2-neuralmagic commented Aug 12, 2024

dirkson commented Aug 12, 2024

oe3gwu commented Aug 12, 2024 •

edited

Loading

[Bug]: Mistral 7B crashes on NVidia Tesla P100 with a CUDA Error #5219

[Bug]: Mistral 7B crashes on NVidia Tesla P100 with a CUDA Error #5219

Comments

oe3gwu commented Jun 3, 2024 • edited Loading

Your current environment

🐛 Describe the bug

robertgshaw2-neuralmagic commented Jun 3, 2024

dirkson commented Aug 12, 2024

robertgshaw2-neuralmagic commented Aug 12, 2024

dirkson commented Aug 12, 2024

oe3gwu commented Aug 12, 2024 • edited Loading

oe3gwu commented Jun 3, 2024 •

edited

Loading

oe3gwu commented Aug 12, 2024 •

edited

Loading