Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Refactor torch inference engine (#871)
* WIP * cache engine wip * finish cache engine * fix cache and scheduler * add paged attention * step and stop * add infer * add request process * fix end * request without schedulersession * add logits processor * better context * update patch * [Improve] Use 4d input in pytorch poc (#371) * 4D input, model.eval and llama config * use auto dtype * tp wip * almost * update logger * run_check=false * little optimize current best redist w/o dtensor host mem in que less rewrite less code update model weight * share attention forward * fix end * Support Baichuan (#382) * add baichuan WIP * support baichuan * support baichuan-13b * fix * add chat template * lint * comments * fix * Move `q_seq_info` into `context` (#398) * move q seq info into context * remove debugs * remove debugs * alibi wip * add alibi * reduce logic block (#435) * add docstring * add baichuan lint (#445) * add fill cache back * support internlm * fix path of weight index * Support chatglm2 in pytorch_poc (#360) * draft support for chatglm2 * debug llama * gitignore * update input_id * better patching * patch chatglm2 model * fix after merge * remove inits * q_seq_info & remove some debug & orig_self * remove old unqeuzze inputid * update patch and model config * remove debugs and clean codes * clean codes * add credit * add update id / fix dependency * rename modules (#504) Co-authored-by: grimoire <[email protected]> * optimize fill kv cache (#523) * optimize fill kv cache * update internlm * faster embedding * fix bias tp * fix baichuan2 * fix fill kv cache * fix lint --------- * Make trust_remote_code as cli argument (#434) * trust_remote_code_argument * format * update tokenizer * optimize rotary * wtf * Support Falcon models (#406) * move q seq info into context * falcon aligned * trust_remote_code_argument * fix for falcon * comment out debugs * comment out debugs * use position id in context * remove codes in falcon model * Revert "comment out debugs" This reverts commit ee26a25. * 7b correct * 1b aligned * remove debugs * patch to ignore position ids * remove debug in alibi, avoid empty inputs * fix * rename dir to replace to "models" * use position_id and new fill kernel * remove useless get_prompt func * fix batch>2 * Refactor scheduler (#551) * optimize block manager * scheduler wip * finish scheduler * update engine * profile pytorch poc (#455) * profile pytorch poc * update doc and import if need * arg * support profile_throughput.py * reuse pytorch session * end session * Support Tensor parallel on Falcon models (#582) * tp falcon 1b and 7b works * remove debugs * update copyright * add some comments * remove a debug * support new hub models * support 40b * support 40b model config * try * recover * fix remain len * Apply rotary kernel (#572) * apply rotary kernel * format * update rmsnorm * update rms norm * better unittest * add docstring --------- Co-authored-by: grimoire <[email protected]> * fix(pytorch_poc): memory cal (#606) * fix(pytorch_poc): memory cal * Optimize attention (#597) * add unittest * add split k * add docstring * fast split k * optimize load * manually setup device and stream * lint --------- Co-authored-by: grimoire <[email protected]> * feat(pytorch_poc): implement ReRoPE (#625) * fix(pytorch_poc): memory cal * style(pytorch_poc): lint * style(.pre-commit-config.yaml): update * style(pytorch_poc): remove useless * feat(pytorch_poc): llama2 support rerope * feat(pytorch_poc): fix long input generate * feat(lmdeploy): add kernel * feat(lmdeploy): update * feat(lmdeploy): add rerope implementation * fix(lmdeploy/pytorch_poc): apply rotary_emb * fix(lmdeploy): update * style(pytorch_poc): format * style(lmdeploy): fix lint * style(lmdeploy): typo * style(pytorch_poc): format * style(pytorch_poc): format * fix(pytorch_poc): rms_norm add mask * style(pytorch_poc/kernels): format rerope * style(pytorch_poc): format rerope attn function description * style(lmdeploy/pytorch_poc): format * style(pytorch_poc): add code ref * style(pytorch_poc): format rerope attn * Refactor engine (#623) * add agent * optimize postprocess * optimize decoding fill cache * add docstring * logit to cuda * blocksize 128 * optimize pre/post process * fix postprocess * cpu pre/post process * manually setup stream and device * remove context * update model agent * update max session len * remove tqdm * update pre/post process * inplace kernel * avoid kv_len computation * flash decoding with one cache * remove comment * add warning when no enough resources * step if has unfinish * add request manager * better fill kv cache * fix fill kv cache * optimize prefill attention * refractor * refactoring... * add custom output * use cache --------- Co-authored-by: grimoire <[email protected]> * [Feature] w8a8 based on pytorch poc (#595) * refactor smoothquant and support load w8a8 model by from_pretrained * add w8a8 docs * add w8a8 en docs * add convert_to_qmodules function --------- Co-authored-by: grimoire <[email protected]> * feat(lmdeploy): add rerope quantization (#718) * feat(lmdeploy): add rerope quantization * feat(lmdeploy): fix review * [Refactor & Doc] Improve w8a8 and add docstring (#768) * WIP * improve w8a8 and add doc string * add docstring * add docstring * fix lint * rename pytorch poc (#764) * rename pytorch poc * fix lint * add docstring * add docstring * refactor patch * add recompute eviction support * recovery modeling * add docstring * Unified paging (#860) * change 'model_format' to 'qwen' when 'model_name' starts with 'qwen' (#575) * avoid split chinese characters during decoding (#566) * add solar chat template (#576) * robust incremental decode for leading space (#581) * robust incremental decode for leading space * speed up lookup as prefix_space_tokens is shorter than no_prefix_space_tokens * add UT and fix qwen stuff * update solar chat template (#587) * Revert "[Docs] Simplify `build.md` (#370)" (#586) This reverts commit 4b5c2bd. * Fix crash and remove `sys_instruct` from `chat.py` and `client.py`(#591) * fix crash * update profile_generation.py * format * use self.bos_id * remove sys_instruct * bump version to v0.0.12 (#604) * Add "build from docker" section (#602) * add build from docker section * update * install python package * update * update * update * Add more user-friendly CLI (#541) * add * import fire in main * wrap to speed up fire cli * update * update docs * update docs * fix * resolve commennts * resolve confict and add test for cli * support inference a batch of prompts (#467) * support inference a batch of prompts * docstring and assert * bump version to v0.0.13 (#620) * Improve api_server and webui usage (#544) * make IPv6 compatible, safe run for coroutine interrupting * instance_id -> session_id and fix api_client.py * update doc * remove useless faq * safe ip mapping * update app.py * WIP completion * completion * update doc * disable interactive mode for /v1/chat/completions * docstring * docstring * refactor gradio * update gradio * udpate * update doc * rename * session_id default -1 * missed two files * add a APIClient * add chat func for APIClient * refine * add concurrent function * sequence_start, sequence_end --> interactive_mode * update doc * comments * doc * better text completion * remove /v1/embeddings * comments * deprecate generate and use /v1/interactive/completions * /v1/interactive/completion -> /v1/chat/interactive * embeddings * rename * remove wrong arg description * docstring * fix * update cli * update doc * strict session_len limit condition * pass model args to api_server * fix: gradio gr.Button.update deprecated after 4.0.0 (#637) * add cli to list the supported model names (#639) * update * resolve comment * Refactor model conversion (#296) * split deploy.py * fix get_cuda_tensor * deploy qwen_awq * fix lint * add docstring * fix * support baichuan/baichuan-awq * parameterizing size_per_head * remove try/except * limit input model_format * add quant_path param * remove old deploy.py * fix path * fix transformer layer range when load bins * fix qwen init * split & save log * relative import * update get_config * WeightFileMgr -> Reader * rename * update * fix init_layer_id * rename llama.py -> meta_llama.py, hf.py -> llama.py * reduce code * update arg description * fix meta llama * manually cleanup meta model params * [Enchance] internlm message to prompt (#499) * update turbomind session_len with model.session_len (#634) * [Fix] Qwen's quantization results are abnormal & Baichuan cannot be quantized (#605) * fix awq * adapt new qwen code * adapt qwen 14b and baichuan2 7b * add docstring * add runtime error for qwen * FIX: fix stop_session func bug (#578) * FIX: fix stop_session func bug * keep sequence_end = False --------- Co-authored-by: honglei.yan <[email protected]> Co-authored-by: AllentDan <[email protected]> * Manage session id using random int for gradio local mode (#553) * Use session id from gradio state * use a new session id after reset * rename session id like a state * update comments * reformat files * init session id on block loaded * use auto increased session id * remove session id textbox * apply to api_server and tritonserver * update docstring * add lock for safety --------- Co-authored-by: AllentDan <[email protected]> * fix benchmark serving computation mistake (#630) * fix benchmark serving computation mistake * fix timestamps computations * remove speed up * no mp * mp seems faster? * remove * update * remove * fix * update * update print log * typo * print fist token latency only stream==True * remove renew_session * update AsyncEngine * fix tokenizer_info when convert the model (#661) * Add check env sub command (#654) * add check env * update issue template' * remove some reqs from check env * resolve comment * fix Tokenizer load error when the path of the being-converted model is not writable (#669) * Add UltraCM and WizardLM chat templates (#599) * add ultracm eval chat template * add WizardLM chat template * use ultrachat template instead of ultracm usecase * bump version to v0.0.14 (#663) * Add extra_requires to reduce dependencies (#580) * update reqs * update docs * resolve comments * upgrade pydantic * fix rebase * update doc * update * update * update readme * update * add flash-attn * TurboMind 2 (#590) * refresh decoder attention kernel * block-level kv cache * `BlockManager` & `SequenceManager` * update * update * update * update * rename * GQA support * fix context length * GQA dispatch * kv8 * tune * async stream cb * nvtx * config parsing * debug * optimize output cost * split-k decoding * minor * truncate `session_len` by available blocks * minor * license * fix * dispatch `cp.async` * fix linking * fix * fix deadlock * guard input length * correct start offset * fix prefill chunking * fix `cache_block_seq_len` param passing * fix `block_size` fmtstr * fix output tokens * fix batch resizing * fix masking of finished sequences * add debug util * free unused block early * add ntk scaling and logn scaling * cmake flags * fix typo * w4a16 for sm75 * fix msvc build * fix msvc build * fix block verification * fix msvc build * use `std::shuffle` * fix lint * fix lint * fix lint * clear incoming buffer * clear finished requests * fix batch initialization * fix typo * fix typo * fix comparison * [Docs] Update Supported Matrix (#679) * update supported matrix * change the default shard size when saving quantized weights * baichuan2 kv8 * update kv8 docs (#681) * Fix init of batch state (#682) * fix init of finished buf * fix `finished_count` * fix turbomind stream canceling (#686) * fix * instance for each forward * [Fix] Fix load_checkpoint_in_model bug (#690) * fix load_checkpoint_in_model bug * fix comments * fix comments * fix bugs * [Doc] Update restful api doc (#662) * update restful_api.md * add a hint * repeat 3 time * Fix Tokenizer encode (#645) * same encode with HF * sequence_start -> add_bos * complement * Fix wrong eos_id and bos_id obtained through grpc api (#644) * Fix wrong eos_id and bos_id obtained through grpc api * fix according to review comments * update * Optimize for throughput (#701) * tmp * update * update * optimize for throughput * update * fix eos * clean up * fix serving * fix indexed copy * minor * minor --------- Co-authored-by: lvhan028 <[email protected]> * Check-in user guide about turbomind config (#680) * update * update config guide * update guide * upate user guide according to review comments * Replace mmengine with mmengine-lite (#715) * Support loading hf model directly (#685) * turbomind support export model params * fix overflow * support turbomind.from_pretrained * fix tp * support AutoModel * support load kv qparams * update auto_awq * udpate docstring * export lmdeploy version * update doc * remove download_hf_repo * LmdeployForCausalLM -> LmdeployForCausalLM * refactor turbomind.py * update comment * add bfloat16 convert back * support gradio run_locl load hf * support resuful api server load hf * add docs * support loading previous quantized model * adapt pr 690 * udpate docs * not export turbomind config when quantize a model * check model_name when can not get it from config.json * update readme * remove model_name in auto_awq * update * update * udpate * fix build * absolute import * Fix cache/output length calculation (#738) * bump version to v0.1.0a0 (#709) * [Fix] Skip empty batch (#747) * [Fix] build docker image failed since `packaging` is missing (#753) * [Fix] Rollback the data type of input_ids to TYPE_UINT32 in preprocessor's proto (#758) * Set the default value of `max_context_token_num` 1 (#761) * rename pytorch poc * fix lint * add docstring * add docstring * refactor patch * add recompute eviction support * fix typo (#769) * add triton server test and workflow yml (#760) * add triton server test and workflow yml * update * revert changes in dockerfile * update prompts * recovery modeling * fix turbomind build on sm<80 (#754) * fix * fix lint * improvement(build): enable ninja and gold linker (#767) * feat(build): enable ninja and lld * fix(.github): add ninja installation * fix(CI): remove dimsize=256 * fix(CI): add option for generate.sh * fix(docs): update * Report first-token-latency and token-latency percentiles (#736) * update profile scripts * add top_p, top_k and temperature as input arguments * fix input_ids * update profile_throughput * update profile_restful_api * update profile_serving * update * update * add progress bar * remove TODO comments * update * remove useless profile_* argument * remove log level * change concurrency default value to 64 * update restful_api.md * update according to review comments * fix docstring * convert model with hf repo_id (#774) * bump version to 0.1.0a1 (#776) * Update benchmark user guide (#763) * user guide of benchmark generation * update benchmark generation guide * update profiling throughput guide * update profiling api_server guide * rename file names * update profile tis user guide * update * fix according to review comments * update * update according to review comments * updaste * add an example * update * add docstring * add unified paging attention support * refactor block manager * do not alloc zero * Fix early exit condition in attention kernel (#788) * add chat template for Yi (#779) * Fix missed arguments when benchmark static inference performance (#787) * minor fix in the profile scripts and docs * miss arguments * typo * fix lint * update * Unify prefill & decode passes (#775) * Unify prefill and decode passes * dynamic split-fuse * refactor * correct input count calculation * remove unused * lint * lint * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix msvc build * add cuda12.1 build check ci (#782) * update cuda12.1 build check ci * use matrix * auto upload cuda12.1 python pkg to release when create new tag (#784) * add cuda12-whl-release ci * enable environment * test py310-311 windows wheel * fix py310, py311 setup.py error on windows * fix lint * fix extra colon in InternLMChat7B (#796) * fix local kv head num (#806) * Report the inference benchmark of models with different size (#794) * update test scripts for models with different sizes * update * only test after tunning gemm * chmod +x * fix typo * benchmark on a100 * fix typo * fix typo * per-token latency percentile in profile_throughput * fix * fix * rename * make the script accept parameters * minor fix * indent * reformat table * change to 3000 * minor fix * bump version to v0.1.0a2 (#807) * fix out of bounds access (#809) * update scheduler * optimize request * Simplify block manager (#812) * simplify block manager * fix lint * set smem size for repetition penalty kernel (#818) * add mbgemm&mbgemv * fix recompute, fix mbgmm --------- Co-authored-by: Lyu Han <[email protected]> Co-authored-by: AllentDan <[email protected]> Co-authored-by: pppppM <[email protected]> Co-authored-by: Chen Xin <[email protected]> Co-authored-by: RunningLeon <[email protected]> Co-authored-by: Yam(长琴) <[email protected]> Co-authored-by: liukuikun <[email protected]> Co-authored-by: yunzhongyan0 <[email protected]> Co-authored-by: honglei.yan <[email protected]> Co-authored-by: AllentDan <[email protected]> Co-authored-by: aisensiy <[email protected]> Co-authored-by: Li Zhang <[email protected]> Co-authored-by: whcao <[email protected]> Co-authored-by: Zaida Zhou <[email protected]> Co-authored-by: tpoisonooo <[email protected]> Co-authored-by: Qian Zhao <[email protected]> * [Fix] Adapt to the pyTorch poc branch (#863) * Adapt to the pyTorch poc branch * Adapt to the pyTorch poc branch * fix comments * update model * update benchmark * [Fix] Fix conflicts in `lite` (#878) * cherry-pick Fix meta tensor error commits * fix smooth quant --------- Co-authored-by: pppppM <[email protected]> * [Feature] Support w8a8 tp (#888) * fix smooth quant save_pretrained * support w8a8 tp * change weight and bias in QLinear back to buffer * remove debug codes and add comments * fix message step update * update docs --------- Co-authored-by: grimoire <[email protected]> Co-authored-by: WRH <[email protected]> Co-authored-by: AllentDan <[email protected]> Co-authored-by: AllentDan <[email protected]> Co-authored-by: tpoisonooo <[email protected]> Co-authored-by: whcao <[email protected]> Co-authored-by: pppppM <[email protected]> Co-authored-by: Chen Xin <[email protected]> Co-authored-by: RunningLeon <[email protected]> Co-authored-by: Yam(长琴) <[email protected]> Co-authored-by: liukuikun <[email protected]> Co-authored-by: yunzhongyan0 <[email protected]> Co-authored-by: honglei.yan <[email protected]> Co-authored-by: aisensiy <[email protected]> Co-authored-by: Li Zhang <[email protected]> Co-authored-by: Zaida Zhou <[email protected]> Co-authored-by: Qian Zhao <[email protected]>
- Loading branch information