Skip to content

Commit

Permalink
Refactor torch inference engine (#871)
Browse files Browse the repository at this point in the history
* WIP

* cache engine wip

* finish cache engine

* fix cache and scheduler

* add paged attention

* step and stop

* add infer

* add request process

* fix end

* request without schedulersession

* add logits processor

* better context

* update patch

* [Improve] Use 4d input in pytorch poc (#371)

* 4D input, model.eval and llama config

* use auto dtype

* tp wip

* almost

* update logger

* run_check=false

* little optimize

current best

redist w/o dtensor

host mem in que

less rewrite

less code

update model weight

* share attention forward

* fix end

* Support Baichuan (#382)

* add baichuan WIP

* support baichuan

* support baichuan-13b

* fix

* add chat template

* lint

* comments

* fix

* Move `q_seq_info` into `context` (#398)

* move q seq info into context

* remove debugs

* remove debugs

* alibi wip

* add alibi

* reduce logic block (#435)

* add docstring

* add baichuan lint (#445)

* add fill cache back

* support internlm

* fix path of weight index

* Support chatglm2 in pytorch_poc (#360)

* draft support for chatglm2

* debug llama

* gitignore

* update input_id

* better patching

* patch chatglm2 model

* fix after merge

* remove inits

* q_seq_info & remove some debug & orig_self

* remove old unqeuzze inputid

* update patch and model config

* remove debugs and clean codes

* clean codes

* add credit

* add update id / fix dependency

* rename modules (#504)

Co-authored-by: grimoire <[email protected]>

* optimize fill kv cache (#523)

* optimize fill kv cache

* update internlm

* faster embedding

* fix bias tp

* fix baichuan2

* fix fill kv cache

* fix lint

---------

* Make trust_remote_code as cli argument (#434)

* trust_remote_code_argument

* format

* update tokenizer

* optimize rotary

* wtf

* Support Falcon models (#406)

* move q seq info into context

* falcon aligned

* trust_remote_code_argument

* fix for falcon

* comment out debugs

* comment out debugs

* use position id in context

* remove codes in falcon model

* Revert "comment out debugs"

This reverts commit ee26a25.

* 7b correct

* 1b aligned

* remove debugs

* patch to ignore position ids

* remove debug in alibi, avoid empty inputs

* fix

* rename dir to replace to "models"

* use position_id and new fill kernel

* remove useless get_prompt func

* fix batch>2

* Refactor scheduler (#551)

* optimize block manager

* scheduler wip

* finish scheduler

* update engine

* profile pytorch poc (#455)

* profile pytorch poc

* update doc and import if need

* arg

* support profile_throughput.py

* reuse pytorch session

* end session

* Support Tensor parallel on Falcon models (#582)

* tp falcon 1b and 7b works

* remove debugs

* update copyright

* add some comments

* remove a debug

* support new hub models

* support 40b

* support 40b model config

* try

* recover

* fix remain len

* Apply rotary kernel (#572)

* apply rotary kernel

* format

* update rmsnorm

* update rms norm

* better unittest

* add docstring

---------

Co-authored-by: grimoire <[email protected]>

* fix(pytorch_poc): memory cal (#606)

* fix(pytorch_poc): memory cal

* Optimize attention (#597)

* add unittest

* add split k

* add docstring

* fast split k

* optimize load

* manually setup device and stream

* lint

---------

Co-authored-by: grimoire <[email protected]>

* feat(pytorch_poc): implement ReRoPE (#625)

* fix(pytorch_poc): memory cal

* style(pytorch_poc): lint

* style(.pre-commit-config.yaml): update

* style(pytorch_poc): remove useless

* feat(pytorch_poc): llama2 support rerope

* feat(pytorch_poc): fix long input generate

* feat(lmdeploy): add kernel

* feat(lmdeploy): update

* feat(lmdeploy): add rerope implementation

* fix(lmdeploy/pytorch_poc): apply rotary_emb

* fix(lmdeploy): update

* style(pytorch_poc): format

* style(lmdeploy): fix lint

* style(lmdeploy): typo

* style(pytorch_poc): format

* style(pytorch_poc): format

* fix(pytorch_poc): rms_norm add mask

* style(pytorch_poc/kernels): format rerope

* style(pytorch_poc): format rerope attn function description

* style(lmdeploy/pytorch_poc): format

* style(pytorch_poc): add code ref

* style(pytorch_poc): format rerope attn

* Refactor engine (#623)

* add agent

* optimize postprocess

* optimize decoding fill cache

* add docstring

* logit to cuda

* blocksize 128

* optimize pre/post process

* fix postprocess

* cpu pre/post process

* manually setup stream and device

* remove context

* update model agent

* update max session len

* remove tqdm

* update pre/post process

* inplace kernel

* avoid kv_len computation

* flash decoding with one cache

* remove comment

* add warning when no enough resources

* step if has unfinish

* add request manager

* better fill kv cache

* fix fill kv cache

* optimize prefill attention

* refractor

* refactoring...

* add custom output

* use cache

---------

Co-authored-by: grimoire <[email protected]>

* [Feature] w8a8 based on pytorch poc (#595)

* refactor smoothquant and support load w8a8 model by from_pretrained

* add w8a8 docs

* add w8a8 en docs

* add convert_to_qmodules function

---------

Co-authored-by: grimoire <[email protected]>

* feat(lmdeploy): add rerope quantization (#718)

* feat(lmdeploy): add rerope quantization

* feat(lmdeploy): fix review

* [Refactor & Doc] Improve w8a8 and add docstring (#768)

* WIP

* improve w8a8 and add doc string

* add docstring

* add docstring

* fix lint

* rename pytorch poc (#764)

* rename pytorch poc

* fix lint

* add docstring

* add docstring

* refactor patch

* add recompute eviction support

* recovery modeling

* add docstring

* Unified paging (#860)

* change 'model_format' to 'qwen' when 'model_name' starts with 'qwen' (#575)

* avoid split chinese characters during decoding (#566)

* add solar chat template (#576)

* robust incremental decode for leading space (#581)

* robust incremental decode for leading space

* speed up lookup as prefix_space_tokens is shorter than no_prefix_space_tokens

* add UT and fix qwen stuff

* update solar chat template (#587)

* Revert "[Docs] Simplify `build.md` (#370)" (#586)

This reverts commit 4b5c2bd.

* Fix crash and remove `sys_instruct` from `chat.py` and `client.py`(#591)

* fix crash

* update profile_generation.py

* format

* use self.bos_id

* remove sys_instruct

* bump version to v0.0.12 (#604)

* Add "build from docker" section (#602)

* add build from docker section

* update

* install python package

* update

* update

* update

* Add more user-friendly CLI  (#541)

* add

* import fire in main

* wrap to speed up fire cli

* update

* update docs

* update docs

* fix

* resolve commennts

* resolve confict and add test for cli

* support inference a batch of prompts (#467)

* support inference a batch of prompts

* docstring and assert

* bump version to v0.0.13 (#620)

* Improve api_server and webui usage (#544)

* make IPv6 compatible, safe run for coroutine interrupting

* instance_id -> session_id and fix api_client.py

* update doc

* remove useless faq

* safe ip mapping

* update app.py

* WIP completion

* completion

* update doc

* disable interactive mode for /v1/chat/completions

* docstring

* docstring

* refactor gradio

* update gradio

* udpate

* update doc

* rename

* session_id default -1

* missed two files

* add a APIClient

* add chat func for APIClient

* refine

* add concurrent function

* sequence_start, sequence_end --> interactive_mode

* update doc

* comments

* doc

* better text completion

* remove /v1/embeddings

* comments

* deprecate generate and use /v1/interactive/completions

* /v1/interactive/completion -> /v1/chat/interactive

* embeddings

* rename

* remove wrong arg description

* docstring

* fix

* update cli

* update doc

* strict session_len limit condition

* pass model args to api_server

* fix: gradio gr.Button.update deprecated after 4.0.0 (#637)

* add cli to list the supported model names (#639)

* update

* resolve comment

* Refactor model conversion (#296)

* split deploy.py

* fix get_cuda_tensor

* deploy qwen_awq

* fix lint

* add docstring

* fix

* support baichuan/baichuan-awq

* parameterizing size_per_head

* remove try/except

* limit input model_format

* add quant_path param

* remove old deploy.py

* fix path

* fix transformer layer range when load bins

* fix qwen init

* split & save log

* relative import

* update get_config

* WeightFileMgr -> Reader

* rename

* update

* fix init_layer_id

* rename llama.py -> meta_llama.py, hf.py -> llama.py

* reduce code

* update arg description

* fix meta llama

* manually cleanup meta model params

* [Enchance] internlm message to prompt (#499)

* update turbomind session_len with model.session_len (#634)

* [Fix] Qwen's quantization results are abnormal & Baichuan cannot be quantized (#605)

* fix awq

* adapt new qwen code

* adapt qwen 14b and baichuan2 7b

* add docstring

* add runtime error for qwen

* FIX: fix stop_session func bug (#578)

* FIX: fix stop_session func bug

* keep sequence_end = False

---------

Co-authored-by: honglei.yan <[email protected]>
Co-authored-by: AllentDan <[email protected]>

* Manage session id using random int for gradio local mode (#553)

* Use session id from gradio state

* use a new session id after reset

* rename session id like a state

* update comments

* reformat files

* init session id on block loaded

* use auto increased session id

* remove session id textbox

* apply to api_server and tritonserver

* update docstring

* add lock for safety

---------

Co-authored-by: AllentDan <[email protected]>

* fix benchmark serving computation mistake (#630)

* fix benchmark serving computation mistake

* fix timestamps computations

* remove speed up

* no mp

* mp seems faster?

* remove

* update

* remove

* fix

* update

* update print log

* typo

* print fist token latency only stream==True

* remove renew_session

* update AsyncEngine

* fix tokenizer_info when convert the model (#661)

* Add check env sub command (#654)

* add check env

* update issue template'

* remove some reqs from check env

* resolve comment

* fix Tokenizer load error when the path of the being-converted  model is not writable (#669)

* Add UltraCM and WizardLM chat templates (#599)

* add ultracm eval chat template

* add WizardLM chat template

* use ultrachat template instead of ultracm usecase

* bump version to v0.0.14 (#663)

* Add extra_requires to reduce dependencies (#580)

* update reqs

* update docs

* resolve comments

* upgrade pydantic

* fix rebase

* update doc

* update

* update

* update readme

* update

* add flash-attn

* TurboMind 2 (#590)

* refresh decoder attention kernel

* block-level kv cache

* `BlockManager` & `SequenceManager`

* update

* update

* update

* update

* rename

* GQA support

* fix context length

* GQA dispatch

* kv8

* tune

* async stream cb

* nvtx

* config parsing

* debug

* optimize output cost

* split-k decoding

* minor

* truncate `session_len` by available blocks

* minor

* license

* fix

* dispatch `cp.async`

* fix linking

* fix

* fix deadlock

* guard input length

* correct start offset

* fix prefill chunking

* fix `cache_block_seq_len` param passing

* fix `block_size` fmtstr

* fix output tokens

* fix batch resizing

* fix masking of finished sequences

* add debug util

* free unused block early

* add ntk scaling and logn scaling

* cmake flags

* fix typo

* w4a16 for sm75

* fix msvc build

* fix msvc build

* fix block verification

* fix msvc build

* use `std::shuffle`

* fix lint

* fix lint

* fix lint

* clear incoming buffer

* clear finished requests

* fix batch initialization

* fix typo

* fix typo

* fix comparison

* [Docs] Update Supported Matrix (#679)

* update supported matrix

* change the default shard size when saving quantized weights

* baichuan2 kv8

* update kv8 docs (#681)

* Fix init of batch state (#682)

* fix init of finished buf

* fix `finished_count`

* fix turbomind stream canceling (#686)

* fix

* instance for each forward

* [Fix] Fix load_checkpoint_in_model bug (#690)

* fix load_checkpoint_in_model bug

* fix comments

* fix comments

* fix bugs

* [Doc] Update restful api doc (#662)

* update restful_api.md

* add a hint

* repeat 3 time

* Fix Tokenizer encode (#645)

* same encode with HF

* sequence_start -> add_bos

* complement

* Fix wrong eos_id and bos_id obtained through grpc api (#644)

* Fix wrong eos_id and bos_id obtained through grpc api

* fix according to review comments

* update

* Optimize for throughput (#701)

* tmp

* update

* update

* optimize for throughput

* update

* fix eos

* clean up

* fix serving

* fix indexed copy

* minor

* minor

---------

Co-authored-by: lvhan028 <[email protected]>

* Check-in user guide about turbomind config (#680)

* update

* update config guide

* update guide

* upate user guide according to review comments

* Replace mmengine with mmengine-lite (#715)

* Support loading hf model directly (#685)

* turbomind support export model params

* fix overflow

* support turbomind.from_pretrained

* fix tp

* support AutoModel

* support load kv qparams

* update auto_awq

* udpate docstring

* export lmdeploy version

* update doc

* remove download_hf_repo

* LmdeployForCausalLM -> LmdeployForCausalLM

* refactor turbomind.py

* update comment

* add bfloat16 convert back

* support gradio run_locl load hf

* support resuful api server load hf

* add docs

* support loading previous quantized model

* adapt pr 690

* udpate docs

* not export turbomind config when quantize a model

* check model_name when can not get it from config.json

* update readme

* remove model_name in auto_awq

* update

* update

* udpate

* fix build

* absolute import

* Fix cache/output length calculation (#738)

* bump version to v0.1.0a0 (#709)

* [Fix] Skip empty batch (#747)

* [Fix] build docker image failed since `packaging` is missing (#753)

* [Fix] Rollback the data type of input_ids to TYPE_UINT32 in preprocessor's proto (#758)

* Set the default value of `max_context_token_num` 1 (#761)

* rename pytorch poc

* fix lint

* add docstring

* add docstring

* refactor patch

* add recompute eviction support

* fix typo (#769)

* add triton server test and workflow yml (#760)

* add triton server test and workflow yml

* update

* revert changes in dockerfile

* update prompts

* recovery modeling

* fix turbomind build on sm<80 (#754)

* fix

* fix lint

* improvement(build): enable ninja and gold linker (#767)

* feat(build): enable ninja and lld

* fix(.github): add ninja installation

* fix(CI): remove dimsize=256

* fix(CI): add option for generate.sh

* fix(docs): update

* Report first-token-latency and token-latency percentiles (#736)

* update profile scripts

* add top_p, top_k and temperature as input arguments

* fix input_ids

* update profile_throughput

* update profile_restful_api

* update profile_serving

* update

* update

* add progress bar

* remove TODO comments

* update

* remove useless profile_* argument

* remove log level

* change concurrency default value to 64

* update restful_api.md

* update according to review comments

* fix docstring

* convert model with hf repo_id (#774)

* bump version to 0.1.0a1 (#776)

* Update benchmark user guide (#763)

* user guide of benchmark generation

* update benchmark generation guide

* update profiling throughput guide

* update profiling api_server guide

* rename file names

* update profile tis user guide

* update

* fix according to review comments

* update

* update according to review comments

* updaste

* add an example

* update

* add docstring

* add unified paging attention support

* refactor block manager

* do not alloc zero

* Fix early exit condition in attention kernel (#788)

* add chat template for Yi (#779)

* Fix missed arguments when benchmark static inference performance (#787)

* minor fix in the profile scripts and docs

* miss arguments

* typo

* fix lint

* update

* Unify prefill & decode passes (#775)

* Unify prefill and decode passes

* dynamic split-fuse

* refactor

* correct input count calculation

* remove unused

* lint

* lint

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* add cuda12.1 build check ci (#782)

* update cuda12.1 build check ci

* use matrix

* auto upload cuda12.1 python pkg to release when create new tag (#784)

* add cuda12-whl-release ci

* enable environment

* test py310-311 windows wheel

* fix py310, py311 setup.py error on windows

* fix lint

* fix extra colon in InternLMChat7B (#796)

* fix local kv head num (#806)

* Report the inference benchmark of models with different size (#794)

* update test scripts for models with different sizes

* update

* only test after tunning gemm

* chmod +x

* fix typo

* benchmark on a100

* fix typo

* fix typo

* per-token latency percentile in profile_throughput

* fix

* fix

* rename

* make the script accept parameters

* minor fix

* indent

* reformat table

* change to 3000

* minor fix

* bump version to v0.1.0a2 (#807)

* fix out of bounds access (#809)

* update scheduler

* optimize request

* Simplify block manager (#812)

* simplify block manager

* fix lint

* set smem size for repetition penalty kernel (#818)

* add mbgemm&mbgemv

* fix recompute, fix mbgmm

---------

Co-authored-by: Lyu Han <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: pppppM <[email protected]>
Co-authored-by: Chen Xin <[email protected]>
Co-authored-by: RunningLeon <[email protected]>
Co-authored-by: Yam(长琴) <[email protected]>
Co-authored-by: liukuikun <[email protected]>
Co-authored-by: yunzhongyan0 <[email protected]>
Co-authored-by: honglei.yan <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: aisensiy <[email protected]>
Co-authored-by: Li Zhang <[email protected]>
Co-authored-by: whcao <[email protected]>
Co-authored-by: Zaida Zhou <[email protected]>
Co-authored-by: tpoisonooo <[email protected]>
Co-authored-by: Qian Zhao <[email protected]>

* [Fix] Adapt to the pyTorch poc branch (#863)

* Adapt to the pyTorch poc branch

* Adapt to the pyTorch poc branch

* fix comments

* update model

* update benchmark

* [Fix] Fix conflicts in `lite` (#878)

* cherry-pick Fix meta tensor error commits

* fix smooth quant

---------

Co-authored-by: pppppM <[email protected]>

* [Feature] Support w8a8 tp (#888)

* fix smooth quant save_pretrained

* support w8a8 tp

* change weight and bias in QLinear back to buffer

* remove debug codes and add comments

* fix message step update

* update docs

---------
Co-authored-by: grimoire <[email protected]>
Co-authored-by: WRH <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: tpoisonooo <[email protected]>
Co-authored-by: whcao <[email protected]>
Co-authored-by: pppppM <[email protected]>
Co-authored-by: Chen Xin <[email protected]>
Co-authored-by: RunningLeon <[email protected]>
Co-authored-by: Yam(长琴) <[email protected]>
Co-authored-by: liukuikun <[email protected]>
Co-authored-by: yunzhongyan0 <[email protected]>
Co-authored-by: honglei.yan <[email protected]>
Co-authored-by: aisensiy <[email protected]>
Co-authored-by: Li Zhang <[email protected]>
Co-authored-by: Zaida Zhou <[email protected]>
Co-authored-by: Qian Zhao <[email protected]>
  • Loading branch information
18 people authored Dec 28, 2023
1 parent ddfa8c4 commit 344e126
Show file tree
Hide file tree
Showing 91 changed files with 15,562 additions and 303 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ jobs:
- name: Check docstring coverage
run: |
python -m pip install interrogate
interrogate -v --ignore-init-method --ignore-magic --ignore-module --ignore-private --ignore-nested-functions --ignore-nested-classes --fail-under 80 lmdeploy
interrogate -v --exclude ./lmdeploy/pytorch_poc/modeling/ --ignore-init-method --ignore-magic --ignore-module --ignore-private --ignore-nested-functions --ignore-nested-classes --fail-under 80 lmdeploy
- name: Check pylint score
run: |
python -m pip install pylint
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ __pycache__/

# Distribution / packaging
.Python
triton-rerope/
develop-eggs/
dist/
downloads/
Expand Down Expand Up @@ -61,6 +62,7 @@ work_dir*/
!lmdeploy/turbomind/hf_repo/config.json

# Pytorch
*.pt
*.pth
*.py~
*.sh~
Expand Down
8 changes: 7 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ repos:
rev: 4.0.1
hooks:
- id: flake8
args: ["--exclude=lmdeploy/turbomind/triton_models/*"]
args: ["--exclude=lmdeploy/turbomind/triton_models/*", "--max-line-length=79"]
- repo: https://github.com/PyCQA/isort
rev: 5.11.5
hooks:
Expand All @@ -12,6 +12,12 @@ repos:
rev: v0.32.0
hooks:
- id: yapf
name: yapf
description: 'Formatter for Python code'
entry: yapf
language: python
args: ['-i', '--style={based_on_style: pep8, column_limit: 79}']

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.2.0
hooks:
Expand Down
Loading

0 comments on commit 344e126

Please sign in to comment.