Skip to content

Latest commit

 

History

History
79 lines (52 loc) · 3.06 KB

inference_server.md

File metadata and controls

79 lines (52 loc) · 3.06 KB

Yuan2.0 Inference-Server

Introduction

This document provides instructions for inference-server of Yuan2.0.

  • [CKPT model Inference-Server](#CKPT model Inference-Server)
  • [HuggingFace model Inference-Server](#HuggingFace model Inference-Server)
  • [API Testing](#API Testing)

CKPT model Inference-Server

  • First step,modify the script file

    TOKENIZER_MODEL_PATH indicates the storage path for TOKENIZER related files; CHECKPOINT_PATH indicates the storage path for model related files; GPUS_PER_NODE indicates the number of GPU cards used for this node, this number should be consistent with the number of parallel paths for model tensors; CUDA_VISIBLE_DEVICES indicates the GPU number used, the number of used numbers should be consistent with GPUS_PER_NODEPORT indicates the port number used by the service, one service occupies one port number, the user can modify it according to the actual situation.

  • Second step, run the script in the warehouse for deployment

    #2.1B deployment command
    bash examples/run_inference_server_2.1B.sh
    
    #51B deployment command
    bash examples/run_inference_server_51B.sh
    
    #102B deployment command
    bash examples/run_inference_server_102B.sh

HuggingFace model Inference-Server

  • First step,modify the script file: examples/run_inference_server_hf.sh

      `HF_PATH` indicates the storage path for HuggingFace model related files;
      `CUDA_VISIBLE_DEVICES` indicates the GPU number used;
      `PORT` indicates the port number used by the service, one service occupies one port number, the user can modify it according to the actual situation.
    
  • Second step, run the script in the warehouse for deployment

    bash examples/run_inference_server_hf.sh
  • Attention:if running in Windows/CPU, flash_atten needs to be turned off manually, and HuggingFace model file code needs to be modified as follows

    Modify "use_flash_attention" in config.json to false;
    Comment lines 35 and 36 in yuan_hf_model.py;
    Modify line 271 in yuan_hf_model.py to inference_hidden_states_memory = torch.empty(bsz, 2, hidden_states.shape[2], dtype=hidden_states.dtype)
    

API Testing

  • Testing with Python

Also, we have written a sample code to test the performance of the API calls. Before running, make sure to modify the 'ip' and 'port' in the code according to the API deployment situation.

python tools/start_inference_server_api.py
  • Testing with Curl
#return the Unicode encoding
curl http://127.0.0.1:8000/yuan -X PUT   \
--header 'Content-Type: application/json' \
--data '{"ques_list":[{"id":"000","ques":"请帮忙作一首诗,主题是冬至"}], "tokens_to_generate":500, "top_k":5}'

# return the original form
echo -en "$(curl -s  http://127.0.0.1:8000/yuan -X PUT  --header 'Content-Type: application/json' --data '{"ques_list":[{"id":"000","ques":"作一首词 ,主题是冬至"}], "tokens_to_generate":500, "top_k":5}')"