You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
python client.py --url localhost:9000 --question "Write python function to sum 3 numbers." --seed 1332 --actor python-programmer
I get
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Invalid number of inputs - Expected: 67; Actual: 66"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Invalid number of inputs - Expected: 67; Actual: 66", grpc_status:3, created_time:"2023-12-20T17:45:16.689999237+00:00"}"
To Reproduce
Steps to reproduce the behavior:
Follow the demo steps.
Expected behavior
A clear and concise description of what you expected to happen.
cphoward
changed the title
LLaMA2 Model Serving Chat Demo Errors on Invalid number of arguments
LLaMA2 Model Serving Chat Demo Errors on Invalid number of inputs
Dec 20, 2023
After playing around with models, I've found something like
defprepare_preprompt_kv_cache(preprompt):
inputs=tokenizer(preprompt, return_tensors="np", add_special_tokens=False)
model_inputs= {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"]
}
# Generate position ids based on the length of the inputseq_length=inputs["input_ids"].shape[1]
model_inputs["position_ids"] =np.arange(seq_length)[None, :]
# Initialize past key values for each layerforiinrange(32):
model_inputs[f"past_key_values.{i}.key"] =np.zeros((1, 32, 0, 128), dtype=np.float32)
model_inputs[f"past_key_values.{i}.value"] =np.zeros((1, 32, 0, 128), dtype=np.float32)
returnclient.predict(inputs=model_inputs, model_name='llama')
won't crash if I also change the PREPROMPT to something relatively short. It crashes when attempting to run with the default PREPROMT:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INTERNAL
details = "Internal inference error"
debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:9000 {created_time:"2023-12-21T22:42:05.540244646+00:00", grpc_status:13, grpc_message:"Internal inference error"}"
The server logs give:
[2023-12-21 22:25:15.297][62][serving][error][modelinstance.cpp:1168] Async caught an exception Internal inference error: Exception from src/inference/src/infer_request.cpp:256:
Exception from src/inference/src/dev/converter_utils.cpp:707:
[ GENERAL_ERROR ] Shape inference of Multiply node with name __module.model.layers.0.self_attn/aten::mul/Multiply failed: Exception from src/plugins/intel_cpu/src/shape_inference/custom/eltwise.cpp:47:
I have confirmed with a custom script I wrote that the model can do inference, but it's mostly gibberish and results in very few characters.
Describe the bug
I am attempting to run the LLaMA2 demo at https://github.com/openvinotoolkit/model_server/blob/main/demos/llama_chat/python/README.md. When I run:
python client.py --url localhost:9000 --question "Write python function to sum 3 numbers." --seed 1332 --actor python-programmer
I get
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
I expected results similar to demo documentation
Configuration
Additional context
I did install
nncf
forint8
compression. Is there a way to configure the example to useint4
compression?** Update **
It seems the missing argument is
position_ids
.The text was updated successfully, but these errors were encountered: