-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Model]: dunzhang/stella_en_1.5B_v5 #10119
Comments
#9704 has already added |
fail,
but ... |
Please follow these instructions on how to install vLLM from source. You should not |
@DarkLight1337 The installation problem is resolved
|
Can you show your environment by running |
I'm sure it's not an installation problem. Embedding model especially the model processed by sentence transformer,It seems that the first round has not stopped continuing to reason in general. |
in vllm.attention.backends.utilswhen i use embedding model def compute_slot_mapping(is_profile_run: bool, slot_mapping: List[int],
code in function CommonMetadataBuilder._add_seq_groupCompute block table.
|
Please note that embedding models are not supported for the CPU version of vLLM. That is why I asked. |
I've switched to pip installation:
|
Do you get a similar behavior using the HuggingFace tokenizers directly? It may just be that they use different tokenizers. |
in vllm/attention/backends/utils.py : 165L - 171L
the block_tables ====>
change code to
sucess get the outputs.embedding when i use native qwen2 embedding model |
Different from qwen2 It has extra tokens files: added_tokens.json |
Oh, it looks like this repo uses Edit: Hmm actually, I don't think we can properly load the linear layer right now... |
@maxdebayser any thoughts on how to load |
huggingface support model like this:
=================================================== |
$tree stella_en_1.5B_v5 |
vLLM loads weights in a similar way to HF transformers, so it only considers the root directory of the repo (in this case, the main |
Many models have additional linear layers, but these linear layers are added after the pooler |
There's only one difference config between the two models (stella_en_1.5B_v5 and Qwen2.5-0.5B-Instruct) is max_position_embeddings ,stella_en_1.5B_v5 config.max_position_embeddings==131072 Qwen2.5-0.5B-Instruct.max_position_embeddings==32768, and stella_en_1.5B_v5 config.enable_chunked_prefill == True, Qwen2.5-0.5B-Instruct config.enable_chunked_prefill == False. |
I think we could add another Layer type like Pooler that we load after the model is loaded. After that the embedding model runner could run this layer if it exists after like the pooler. Just thinking out loud: would it make sense to have some kind of wrapper model for sentence transformer models where we can load these models in a generic way and handle these different kinds of configurations or should the embedding model_runner handle those? Since the model runner is somewhat tied to the hardware back end, perhaps this would be better to encapsulate hardware-independent behavior that is common to sentence transformer models. cc: @flaviabeo |
The model to consider.
https://huggingface.co/dunzhang/stella_en_1.5B_v5
last_hidden_state = model(**input_data)[0]
in init model:
vector_linear = torch.nn.Linear(in_features=model.config.hidden_size, out_features=vector_dim)
vector_linear_dict = {
k.replace("linear.", ""): v for k, v in
torch.load(os.path.join(model_dir, f"{vector_linear_directory}/pytorch_model.bin")).items()
}
vector_linear.load_state_dict(vector_linear_dict)
This model is a qwen2 base model,but in the end, a separate linear layer needs to be loaded。
last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
query_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
query_vectors = normalize(vector_linear(query_vectors)
The closest model vllm already supports.
No response
What's your difficulty of supporting the model you want?
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: