In this tutorial, We will present a list of examples to introduce the usage of lmdeploy.pipeline
.
You can overview the detailed pipeline API in this guide.
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2_5-7b-chat')
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)
In this example, the pipeline by default allocates a predetermined percentage of GPU memory for storing k/v cache. The ratio is dictated by the parameter TurbomindEngineConfig.cache_max_entry_count
.
There have been alterations to the strategy for setting the k/v cache ratio throughout the evolution of LMDeploy. The following are the change histories:
-
v0.2.0 <= lmdeploy <= v0.2.1
TurbomindEngineConfig.cache_max_entry_count
defaults to 0.5, indicating 50% GPU total memory allocated for k/v cache. Out Of Memory (OOM) errors may occur if a 7B model is deployed on a GPU with memory less than 40G. If you encounter an OOM error, please decrease the ratio of the k/v cache occupation as follows:from lmdeploy import pipeline, TurbomindEngineConfig # decrease the ratio of the k/v cache occupation to 20% backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2) pipe = pipeline('internlm/internlm2_5-7b-chat', backend_config=backend_config) response = pipe(['Hi, pls intro yourself', 'Shanghai is']) print(response)
-
lmdeploy > v0.2.1
The allocation strategy for k/v cache is changed to reserve space from the GPU free memory proportionally. The ratio
TurbomindEngineConfig.cache_max_entry_count
has been adjusted to 0.8 by default. If OOM error happens, similar to the method mentioned above, please consider reducing the ratio value to decrease the memory usage of the k/v cache.
from lmdeploy import pipeline, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(tp=2)
pipe = pipeline('internlm/internlm2_5-7b-chat',
backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
pipe = pipeline('internlm/internlm2_5-7b-chat',
backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
gen_config=gen_config)
print(response)
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
pipe = pipeline('internlm/internlm2_5-7b-chat',
backend_config=backend_config)
prompts = [[{
'role': 'user',
'content': 'Hi, pls intro yourself'
}], [{
'role': 'user',
'content': 'Shanghai is'
}]]
response = pipe(prompts,
gen_config=gen_config)
print(response)
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
pipe = pipeline('internlm/internlm2_5-7b-chat',
backend_config=backend_config)
prompts = [[{
'role': 'user',
'content': 'Hi, pls intro yourself'
}], [{
'role': 'user',
'content': 'Shanghai is'
}]]
for item in pipe.stream_infer(prompts, gen_config=gen_config):
print(item)
from lmdeploy import pipeline, GenerationConfig
pipe = pipeline('internlm/internlm2_5-7b-chat')
gen_config=GenerationConfig(output_logits='generation'
max_new_tokens=10)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
gen_config=gen_config)
logits = [x.logits for x in response]
Get last layer's hidden states for generated tokens
from lmdeploy import pipeline, GenerationConfig
pipe = pipeline('internlm/internlm2_5-7b-chat')
gen_config=GenerationConfig(output_last_hidden_state='generation',
max_new_tokens=10)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
gen_config=gen_config)
hidden_states = [x.last_hidden_state for x in response]
from transformers import AutoTokenizer
from lmdeploy import pipeline
model_repoid_or_path = 'internlm/internlm2_5-7b-chat'
pipe = pipeline(model_repoid_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_repoid_or_path, trust_remote_code=True)
messages = [
{"role": "user", "content": "Hello, how are you?"},
]
input_ids = tokenizer.apply_chat_template(messages)
# ppl is a list of float numbers
ppl = pipe.get_ppl(input_ids)
print(ppl)
- When input_ids is too long, an OOM (Out Of Memory) error may occur. Please apply it with caution
- get_ppl returns the cross entropy loss without applying the exponential operation afterwards
pip install triton>=2.1.0
from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig
backend_config = PytorchEngineConfig(session_len=2048)
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
pipe = pipeline('internlm/internlm2_5-7b-chat',
backend_config=backend_config)
prompts = [[{
'role': 'user',
'content': 'Hi, pls intro yourself'
}], [{
'role': 'user',
'content': 'Shanghai is'
}]]
response = pipe(prompts, gen_config=gen_config)
print(response)
from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig
backend_config = PytorchEngineConfig(session_len=2048,
adapters=dict(lora_name_1='chenchi/lora-chatglm2-6b-guodegang'))
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
pipe = pipeline('THUDM/chatglm2-6b',
backend_config=backend_config)
prompts = [[{
'role': 'user',
'content': '您猜怎么着'
}]]
response = pipe(prompts, gen_config=gen_config, adapter_name='lora_name_1')
print(response)
-
RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
If you got this for tp>1 in pytorch backend. Please make sure the python script has following
if __name__ == '__main__':
Generally, in the context of multi-threading or multi-processing, it might be necessary to ensure that initialization code is executed only once. In this case,
if __name__ == '__main__':
can help to ensure that these initialization codes are run only in the main program, and not repeated in each newly created process or thread. -
To customize a chat template, please refer to chat_template.md.
-
If the weight of lora has a corresponding chat template, you can first register the chat template to lmdeploy, and then use the chat template name as the adapter name.