Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why vllm does not support Chinese input #246

Closed
929359291 opened this issue Jun 26, 2023 · 4 comments · Fixed by #284
Closed

Why vllm does not support Chinese input #246

929359291 opened this issue Jun 26, 2023 · 4 comments · Fixed by #284

Comments

@929359291
Copy link

There is a decode error in Chinese input,is token.decode

@shifan3
Copy link

shifan3 commented Jun 26, 2023

If you use chinese-alpaca/llama, remember their tokenizers are different with the original ones. However in vllm/engine/tokenizer_utils.py it force to use the original llama tokenizer hf-internal-testing/llama-tokenizer. this produce the error. They should allow you to pass a use_fast=False to avoid this behavior, but currently it's not possible.
Before they fix this issue, as a temporary workaround, you can simply replace the tokenizer by
llm.llm_engine.tokenizer = AutoTokenizer.from_pretrained('YOUR CHINESE ALPACA/LLAMA TOKENIZER', use_fast=False)

@929359291
Copy link
Author

If you use chinese-alpaca/llama, remember their tokenizers are different with the original ones. However in vllm/engine/tokenizer_utils.py it force to use the original llama tokenizer hf-internal-testing/llama-tokenizer. this produce the error. They should allow you to pass a use_fast=False to avoid this behavior, but currently it's not possible. Before they fix this issue, as a temporary workaround, you can simply replace the tokenizer by llm.llm_engine.tokenizer = AutoTokenizer.from_pretrained('YOUR CHINESE ALPACA/LLAMA TOKENIZER', use_fast=False)

thinks, i try again

@luoyangen
Copy link

If you use chinese-alpaca/llama, remember their tokenizers are different with the original ones. However in vllm/engine/tokenizer_utils.py it force to use the original llama tokenizer hf-internal-testing/llama-tokenizer. this produce the error. They should allow you to pass a use_fast=False to avoid this behavior, but currently it's not possible. Before they fix this issue, as a temporary workaround, you can simply replace the tokenizer by llm.llm_engine.tokenizer = AutoTokenizer.from_pretrained('YOUR CHINESE ALPACA/LLAMA TOKENIZER', use_fast=False)

thinks, i try again

Hi, Did you successed by passing use_fast=False?
I tried but get "RuntimeError: CUDA error: device-side assert triggered".

@929359291
Copy link
Author

If you use chinese-alpaca/llama, remember their tokenizers are different with the original ones. However in vllm/engine/tokenizer_utils.py it force to use the original llama tokenizer hf-internal-testing/llama-tokenizer. this produce the error. They should allow you to pass a use_fast=False to avoid this behavior, but currently it's not possible. Before they fix this issue, as a temporary workaround, you can simply replace the tokenizer by llm.llm_engine.tokenizer = AutoTokenizer.from_pretrained('YOUR CHINESE ALPACA/LLAMA TOKENIZER', use_fast=False)

thinks, i try again

Hi, Did you successed by passing use_fast=False? I tried but get "RuntimeError: CUDA error: device-side assert triggered".

hi boy, i not try, wait vLLM Development Roadmap #244

@WoosukKwon WoosukKwon linked a pull request Jun 28, 2023 that will close this issue
jikunshang pushed a commit to jikunshang/vllm that referenced this issue Oct 31, 2024
To repro:

start server:
`VLLM_SKIP_WARMUP=true python -m vllm.entrypoints.openai.api_server`

send a request (this works fine):
```
 curl -v http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{"model": "facebook/opt-125m","prompt": "The future of AI is ","max_tokens": 100,"temperature": 0}'
```

if request has a seed it fails:
```
curl -v http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{"model": "facebook/opt-125m","prompt": "The future of AI is ","max_tokens": 100,"temperature": 0, "seed" : 37}'
```

Failure happens here:

[vllm-fork/vllm/model_executor/sampling_metadata.py at habana_main ·
HabanaAI/vllm-fork](https://github.com/HabanaAI/vllm-fork/blob/habana_main/vllm/model_executor/sampling_metadata.py#L220)

```
if sampling_params.seed is not None:
                seq_group_metadata.state.generator = torch.Generator(
                    device=device).manual_seed(sampling_params.seed)
```
 

`RuntimeError: Device type HPU is not supported for torch.Generator()
api.`

This PR fixes above issue by using htrandom [Intel Gaudi PyTorch Python
API (habana_frameworks.torch) — Gaudi Documentation 1.17.1
documentation](https://docs.habana.ai/en/latest/PyTorch/Reference/Python_Packages.html?highlight=htrandom#random-number-generator-apis)
billishyahao pushed a commit to billishyahao/vllm that referenced this issue Dec 31, 2024
* Fix kernel cache miss and add RDNA configs

- added Navi configurations (Related PR: ROCm/triton#640)
- resolved cache miss issue during flash attention calls by fixing max_seqlen_q/k to 0

* Remove Navi autotune configs for triton FP8 support
billishyahao pushed a commit to billishyahao/vllm that referenced this issue Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants