This document provides instructions for inference-server of Yuan2.0.
- [CKPT model Inference-Server](#CKPT model Inference-Server)
- [HuggingFace model Inference-Server](#HuggingFace model Inference-Server)
- [API Testing](#API Testing)
-
First step,modify the script file
TOKENIZER_MODEL_PATH
indicates the storage path for TOKENIZER related files;CHECKPOINT_PATH
indicates the storage path for model related files;GPUS_PER_NODE
indicates the number of GPU cards used for this node, this number should be consistent with the number of parallel paths for model tensors;CUDA_VISIBLE_DEVICES
indicates the GPU number used, the number of used numbers should be consistent withGPUS_PER_NODE
;PORT
indicates the port number used by the service, one service occupies one port number, the user can modify it according to the actual situation. -
Second step, run the script in the warehouse for deployment
#2.1B deployment command bash examples/run_inference_server_2.1B.sh #51B deployment command bash examples/run_inference_server_51B.sh #102B deployment command bash examples/run_inference_server_102B.sh
-
First step,modify the script file: examples/run_inference_server_hf.sh
`HF_PATH` indicates the storage path for HuggingFace model related files; `CUDA_VISIBLE_DEVICES` indicates the GPU number used; `PORT` indicates the port number used by the service, one service occupies one port number, the user can modify it according to the actual situation.
-
Second step, run the script in the warehouse for deployment
bash examples/run_inference_server_hf.sh
-
Attention:if running in Windows/CPU, flash_atten needs to be turned off manually, and HuggingFace model file code needs to be modified as follows
Modify "use_flash_attention" in config.json to false; Comment lines 35 and 36 in yuan_hf_model.py; Modify line 271 in yuan_hf_model.py to inference_hidden_states_memory = torch.empty(bsz, 2, hidden_states.shape[2], dtype=hidden_states.dtype)
- Testing with Python
Also, we have written a sample code to test the performance of the API calls. Before running, make sure to modify the 'ip' and 'port' in the code according to the API deployment situation.
python tools/start_inference_server_api.py
- Testing with Curl
#return the Unicode encoding
curl http://127.0.0.1:8000/yuan -X PUT \
--header 'Content-Type: application/json' \
--data '{"ques_list":[{"id":"000","ques":"请帮忙作一首诗,主题是冬至"}], "tokens_to_generate":500, "top_k":5}'
# return the original form
echo -en "$(curl -s http://127.0.0.1:8000/yuan -X PUT --header 'Content-Type: application/json' --data '{"ques_list":[{"id":"000","ques":"作一首词 ,主题是冬至"}], "tokens_to_generate":500, "top_k":5}')"