Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage session id using random int for gradio local mode #553

Merged
merged 12 commits into from
Nov 6, 2023

Conversation

aisensiy
Copy link
Contributor

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

There is a related issue #503, right now the solution using host ip will make the session break randomly if the gradio service is behind a stateless load balancer.

Modification

Add a session_id State in gradio, the State will bind with current page which will not be influence by host ip from gradio.Request.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@aisensiy
Copy link
Contributor Author

This PR have great conflict with #544 ...

@AllentDan
Copy link
Collaborator

This PR have great conflict with #544 ...

Yes, could you also make a support to api_server mode?

@aisensiy
Copy link
Contributor Author

This PR have great conflict with #544 ...

Yes, could you also make a support to api_server mode?

So do you mean I continue this PR and ignore the modification of the #544 ?

@AllentDan
Copy link
Collaborator

AllentDan commented Oct 13, 2023

I accept both choices. We can merge this request first and I will resolve conflicts. And you can also make changes on #544 and wait a period until #544 gets merged.

@lvhan028 lvhan028 requested a review from AllentDan October 13, 2023 06:35
@lvhan028
Copy link
Collaborator

@aisensiy community is first :)
@AllentDan Let's leave the conflicts to ourselves 😺

@aisensiy
Copy link
Contributor Author

The PR #544 is already a big one which includes more stuff than the topic. Maybe split into multiple PRs will be more easier to manage. 🤔

@lvhan028 lvhan028 requested review from lvhan028 and irexyc October 13, 2023 08:31
@aisensiy
Copy link
Contributor Author

I will wait for the merge of #544 and check if this PR is still necessary.

@irexyc
Copy link
Collaborator

irexyc commented Oct 16, 2023

state_session_id seems equal for every session. I think use a global variable is a better idea.

# global session id
global_session_id = [0]

async def chat_stream_local(
    instruction: str,
    state_chatbot: Sequence,
    cancel_btn: gr.Button,
    reset_btn: gr.Button,
    session_id: int,
):
    if session_id == -1:
        session_id = global_session_id[-1] + 1
        global_session_id[-1] += 1

with gr.Blocks(css=CSS, theme=THEME) as demo:
    # state_session_id = gr.State(random.randint(0, 100000))
    state_session_id = gr.State(-1)

@aisensiy
Copy link
Contributor Author

aisensiy commented Oct 16, 2023

I think you are right, the state do not change cross pages. This scenario need a different session id for every page not for every user.

@aisensiy
Copy link
Contributor Author

Area.mp4

This screencast show the current session id management in two web pages.

I do not use the global variable but use a .load function for the block, all of the changes follow the docs of gradio. And I add a number box to show the current session in the web page. This is for debug and can be removed if this PR is going to merge in the future.

  1. For session management: https://www.gradio.app/guides/state-in-blocks
  2. For load event: https://www.gradio.app/docs/blocks#blocks-load

@aisensiy
Copy link
Contributor Author

Any comment? 🤔

@irexyc
Copy link
Collaborator

irexyc commented Oct 19, 2023

Any comment? 🤔

session_id 依次递增不好么,感觉没必要随机,万一随出来相同的id不就冲突了么

@aisensiy
Copy link
Contributor Author

aisensiy commented Oct 19, 2023

Any comment? 🤔

session_id 依次递增不好么,感觉没必要随机,万一随出来相同的id不就冲突了么

谢谢你的快速回复。对于很大范围的随机生成冲突的概率是极低的,因此不用担心这个问题的。全局自增在某些多线程的情况下可能会出现问题,但对于目前的使用场景这两种方式应该都可以,不过我比较不喜欢那种在方法里获取全局变量并修改它的方式...

import gradio as gr

global_session_id = 0


def update():
    global global_session_id  # 对这种方式不是很喜欢
    global_session_id += 1
    return global_session_id

with gr.Blocks() as demo:
    with gr.Row():
        out = gr.Number()
    btn = gr.Button("Run")
    btn.click(fn=update, inputs=None, outputs=out)
    demo.load(fn=update, inputs=None, outputs=out)

demo.launch()

@AllentDan
Copy link
Collaborator

AllentDan commented Oct 20, 2023

其实 app.py 里也有类似 global 的全局变量管理,你如果不适应,也可以放一个 session_id 变量到 InterFace 统一管理

class InterFace:

感觉上,还是要比随机值安全一些的

@lvhan028
Copy link
Collaborator

lvhan028 commented Nov 1, 2023

Hi, @aisensiy, sorry for keeping you waiting for such a long time.
We finally finished the review of PR #544
Let's work on this PR.
Please resolve the conflicts before we moving on

@aisensiy
Copy link
Contributor Author

aisensiy commented Nov 3, 2023

@lvhan028 ok

@lvhan028
Copy link
Collaborator

lvhan028 commented Nov 3, 2023

@aisensiy Could you fixed it this week? We plan to release a new version next Monday. And we hope to merge this PR before the release.
If you are not available, @AllentDan can take it over.
Do worry about your credit. Your commits history will not be overwritten

@aisensiy aisensiy force-pushed the gradio-session-issue branch from 619375f to b800da9 Compare November 3, 2023 15:23
@aisensiy aisensiy force-pushed the gradio-session-issue branch from b800da9 to 9f301c7 Compare November 3, 2023 15:30
@aisensiy aisensiy force-pushed the gradio-session-issue branch from 9f301c7 to 0a41b77 Compare November 3, 2023 16:21
@AllentDan
Copy link
Collaborator

I made some changes to this PR. Please review them.

@irexyc
Copy link
Collaborator

irexyc commented Nov 6, 2023

api_server_backend.py、triton_server_backend.py 这俩文件里面docstrings可以更新一下,比如:
request (gr.Request): the request from a user -> session_id (int): the session id

@aisensiy
Copy link
Contributor Author

aisensiy commented Nov 6, 2023

I made some changes to this PR. Please review them.

很棒,如果不展示 session 信息,也不修改 session 信息就完全不用在 event listener 里返回了;同时也非常感谢对另外两个部分同样逻辑的补充。

@lvhan028 lvhan028 merged commit 11d1093 into InternLM:main Nov 6, 2023
1 check passed
grimoire added a commit that referenced this pull request Dec 18, 2023
* change 'model_format' to 'qwen' when 'model_name' starts with 'qwen' (#575)

* avoid split chinese characters during decoding (#566)

* add solar chat template (#576)

* robust incremental decode for leading space (#581)

* robust incremental decode for leading space

* speed up lookup as prefix_space_tokens is shorter than no_prefix_space_tokens

* add UT and fix qwen stuff

* update solar chat template (#587)

* Revert "[Docs] Simplify `build.md` (#370)" (#586)

This reverts commit 4b5c2bd.

* Fix crash and remove `sys_instruct` from `chat.py` and `client.py`(#591)

* fix crash

* update profile_generation.py

* format

* use self.bos_id

* remove sys_instruct

* bump version to v0.0.12 (#604)

* Add "build from docker" section (#602)

* add build from docker section

* update

* install python package

* update

* update

* update

* Add more user-friendly CLI  (#541)

* add

* import fire in main

* wrap to speed up fire cli

* update

* update docs

* update docs

* fix

* resolve commennts

* resolve confict and add test for cli

* support inference a batch of prompts (#467)

* support inference a batch of prompts

* docstring and assert

* bump version to v0.0.13 (#620)

* Improve api_server and webui usage (#544)

* make IPv6 compatible, safe run for coroutine interrupting

* instance_id -> session_id and fix api_client.py

* update doc

* remove useless faq

* safe ip mapping

* update app.py

* WIP completion

* completion

* update doc

* disable interactive mode for /v1/chat/completions

* docstring

* docstring

* refactor gradio

* update gradio

* udpate

* update doc

* rename

* session_id default -1

* missed two files

* add a APIClient

* add chat func for APIClient

* refine

* add concurrent function

* sequence_start, sequence_end --> interactive_mode

* update doc

* comments

* doc

* better text completion

* remove /v1/embeddings

* comments

* deprecate generate and use /v1/interactive/completions

* /v1/interactive/completion -> /v1/chat/interactive

* embeddings

* rename

* remove wrong arg description

* docstring

* fix

* update cli

* update doc

* strict session_len limit condition

* pass model args to api_server

* fix: gradio gr.Button.update deprecated after 4.0.0 (#637)

* add cli to list the supported model names (#639)

* update

* resolve comment

* Refactor model conversion (#296)

* split deploy.py

* fix get_cuda_tensor

* deploy qwen_awq

* fix lint

* add docstring

* fix

* support baichuan/baichuan-awq

* parameterizing size_per_head

* remove try/except

* limit input model_format

* add quant_path param

* remove old deploy.py

* fix path

* fix transformer layer range when load bins

* fix qwen init

* split & save log

* relative import

* update get_config

* WeightFileMgr -> Reader

* rename

* update

* fix init_layer_id

* rename llama.py -> meta_llama.py, hf.py -> llama.py

* reduce code

* update arg description

* fix meta llama

* manually cleanup meta model params

* [Enchance] internlm message to prompt (#499)

* update turbomind session_len with model.session_len (#634)

* [Fix] Qwen's quantization results are abnormal & Baichuan cannot be quantized (#605)

* fix awq

* adapt new qwen code

* adapt qwen 14b and baichuan2 7b

* add docstring

* add runtime error for qwen

* FIX: fix stop_session func bug (#578)

* FIX: fix stop_session func bug

* keep sequence_end = False

---------

Co-authored-by: honglei.yan <[email protected]>
Co-authored-by: AllentDan <[email protected]>

* Manage session id using random int for gradio local mode (#553)

* Use session id from gradio state

* use a new session id after reset

* rename session id like a state

* update comments

* reformat files

* init session id on block loaded

* use auto increased session id

* remove session id textbox

* apply to api_server and tritonserver

* update docstring

* add lock for safety

---------

Co-authored-by: AllentDan <[email protected]>

* fix benchmark serving computation mistake (#630)

* fix benchmark serving computation mistake

* fix timestamps computations

* remove speed up

* no mp

* mp seems faster?

* remove

* update

* remove

* fix

* update

* update print log

* typo

* print fist token latency only stream==True

* remove renew_session

* update AsyncEngine

* fix tokenizer_info when convert the model (#661)

* Add check env sub command (#654)

* add check env

* update issue template'

* remove some reqs from check env

* resolve comment

* fix Tokenizer load error when the path of the being-converted  model is not writable (#669)

* Add UltraCM and WizardLM chat templates (#599)

* add ultracm eval chat template

* add WizardLM chat template

* use ultrachat template instead of ultracm usecase

* bump version to v0.0.14 (#663)

* Add extra_requires to reduce dependencies (#580)

* update reqs

* update docs

* resolve comments

* upgrade pydantic

* fix rebase

* update doc

* update

* update

* update readme

* update

* add flash-attn

* TurboMind 2 (#590)

* refresh decoder attention kernel

* block-level kv cache

* `BlockManager` & `SequenceManager`

* update

* update

* update

* update

* rename

* GQA support

* fix context length

* GQA dispatch

* kv8

* tune

* async stream cb

* nvtx

* config parsing

* debug

* optimize output cost

* split-k decoding

* minor

* truncate `session_len` by available blocks

* minor

* license

* fix

* dispatch `cp.async`

* fix linking

* fix

* fix deadlock

* guard input length

* correct start offset

* fix prefill chunking

* fix `cache_block_seq_len` param passing

* fix `block_size` fmtstr

* fix output tokens

* fix batch resizing

* fix masking of finished sequences

* add debug util

* free unused block early

* add ntk scaling and logn scaling

* cmake flags

* fix typo

* w4a16 for sm75

* fix msvc build

* fix msvc build

* fix block verification

* fix msvc build

* use `std::shuffle`

* fix lint

* fix lint

* fix lint

* clear incoming buffer

* clear finished requests

* fix batch initialization

* fix typo

* fix typo

* fix comparison

* [Docs] Update Supported Matrix (#679)

* update supported matrix

* change the default shard size when saving quantized weights

* baichuan2 kv8

* update kv8 docs (#681)

* Fix init of batch state (#682)

* fix init of finished buf

* fix `finished_count`

* fix turbomind stream canceling (#686)

* fix

* instance for each forward

* [Fix] Fix load_checkpoint_in_model bug (#690)

* fix load_checkpoint_in_model bug

* fix comments

* fix comments

* fix bugs

* [Doc] Update restful api doc (#662)

* update restful_api.md

* add a hint

* repeat 3 time

* Fix Tokenizer encode (#645)

* same encode with HF

* sequence_start -> add_bos

* complement

* Fix wrong eos_id and bos_id obtained through grpc api (#644)

* Fix wrong eos_id and bos_id obtained through grpc api

* fix according to review comments

* update

* Optimize for throughput (#701)

* tmp

* update

* update

* optimize for throughput

* update

* fix eos

* clean up

* fix serving

* fix indexed copy

* minor

* minor

---------

Co-authored-by: lvhan028 <[email protected]>

* Check-in user guide about turbomind config (#680)

* update

* update config guide

* update guide

* upate user guide according to review comments

* Replace mmengine with mmengine-lite (#715)

* Support loading hf model directly (#685)

* turbomind support export model params

* fix overflow

* support turbomind.from_pretrained

* fix tp

* support AutoModel

* support load kv qparams

* update auto_awq

* udpate docstring

* export lmdeploy version

* update doc

* remove download_hf_repo

* LmdeployForCausalLM -> LmdeployForCausalLM

* refactor turbomind.py

* update comment

* add bfloat16 convert back

* support gradio run_locl load hf

* support resuful api server load hf

* add docs

* support loading previous quantized model

* adapt pr 690

* udpate docs

* not export turbomind config when quantize a model

* check model_name when can not get it from config.json

* update readme

* remove model_name in auto_awq

* update

* update

* udpate

* fix build

* absolute import

* Fix cache/output length calculation (#738)

* bump version to v0.1.0a0 (#709)

* [Fix] Skip empty batch (#747)

* [Fix] build docker image failed since `packaging` is missing (#753)

* [Fix] Rollback the data type of input_ids to TYPE_UINT32 in preprocessor's proto (#758)

* Set the default value of `max_context_token_num` 1 (#761)

* rename pytorch poc

* fix lint

* add docstring

* add docstring

* refactor patch

* add recompute eviction support

* fix typo (#769)

* add triton server test and workflow yml (#760)

* add triton server test and workflow yml

* update

* revert changes in dockerfile

* update prompts

* recovery modeling

* fix turbomind build on sm<80 (#754)

* fix

* fix lint

* improvement(build): enable ninja and gold linker (#767)

* feat(build): enable ninja and lld

* fix(.github): add ninja installation

* fix(CI): remove dimsize=256

* fix(CI): add option for generate.sh

* fix(docs): update

* Report first-token-latency and token-latency percentiles (#736)

* update profile scripts

* add top_p, top_k and temperature as input arguments

* fix input_ids

* update profile_throughput

* update profile_restful_api

* update profile_serving

* update

* update

* add progress bar

* remove TODO comments

* update

* remove useless profile_* argument

* remove log level

* change concurrency default value to 64

* update restful_api.md

* update according to review comments

* fix docstring

* convert model with hf repo_id (#774)

* bump version to 0.1.0a1 (#776)

* Update benchmark user guide (#763)

* user guide of benchmark generation

* update benchmark generation guide

* update profiling throughput guide

* update profiling api_server guide

* rename file names

* update profile tis user guide

* update

* fix according to review comments

* update

* update according to review comments

* updaste

* add an example

* update

* add docstring

* add unified paging attention support

* refactor block manager

* do not alloc zero

* Fix early exit condition in attention kernel (#788)

* add chat template for Yi (#779)

* Fix missed arguments when benchmark static inference performance (#787)

* minor fix in the profile scripts and docs

* miss arguments

* typo

* fix lint

* update

* Unify prefill & decode passes (#775)

* Unify prefill and decode passes

* dynamic split-fuse

* refactor

* correct input count calculation

* remove unused

* lint

* lint

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* add cuda12.1 build check ci (#782)

* update cuda12.1 build check ci

* use matrix

* auto upload cuda12.1 python pkg to release when create new tag (#784)

* add cuda12-whl-release ci

* enable environment

* test py310-311 windows wheel

* fix py310, py311 setup.py error on windows

* fix lint

* fix extra colon in InternLMChat7B (#796)

* fix local kv head num (#806)

* Report the inference benchmark of models with different size (#794)

* update test scripts for models with different sizes

* update

* only test after tunning gemm

* chmod +x

* fix typo

* benchmark on a100

* fix typo

* fix typo

* per-token latency percentile in profile_throughput

* fix

* fix

* rename

* make the script accept parameters

* minor fix

* indent

* reformat table

* change to 3000

* minor fix

* bump version to v0.1.0a2 (#807)

* fix out of bounds access (#809)

* update scheduler

* optimize request

* Simplify block manager (#812)

* simplify block manager

* fix lint

* set smem size for repetition penalty kernel (#818)

* add mbgemm&mbgemv

* fix recompute, fix mbgmm

---------

Co-authored-by: Lyu Han <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: pppppM <[email protected]>
Co-authored-by: Chen Xin <[email protected]>
Co-authored-by: RunningLeon <[email protected]>
Co-authored-by: Yam(长琴) <[email protected]>
Co-authored-by: liukuikun <[email protected]>
Co-authored-by: yunzhongyan0 <[email protected]>
Co-authored-by: honglei.yan <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: aisensiy <[email protected]>
Co-authored-by: Li Zhang <[email protected]>
Co-authored-by: whcao <[email protected]>
Co-authored-by: Zaida Zhou <[email protected]>
Co-authored-by: tpoisonooo <[email protected]>
Co-authored-by: Qian Zhao <[email protected]>
lvhan028 added a commit that referenced this pull request Dec 28, 2023
* WIP

* cache engine wip

* finish cache engine

* fix cache and scheduler

* add paged attention

* step and stop

* add infer

* add request process

* fix end

* request without schedulersession

* add logits processor

* better context

* update patch

* [Improve] Use 4d input in pytorch poc (#371)

* 4D input, model.eval and llama config

* use auto dtype

* tp wip

* almost

* update logger

* run_check=false

* little optimize

current best

redist w/o dtensor

host mem in que

less rewrite

less code

update model weight

* share attention forward

* fix end

* Support Baichuan (#382)

* add baichuan WIP

* support baichuan

* support baichuan-13b

* fix

* add chat template

* lint

* comments

* fix

* Move `q_seq_info` into `context` (#398)

* move q seq info into context

* remove debugs

* remove debugs

* alibi wip

* add alibi

* reduce logic block (#435)

* add docstring

* add baichuan lint (#445)

* add fill cache back

* support internlm

* fix path of weight index

* Support chatglm2 in pytorch_poc (#360)

* draft support for chatglm2

* debug llama

* gitignore

* update input_id

* better patching

* patch chatglm2 model

* fix after merge

* remove inits

* q_seq_info & remove some debug & orig_self

* remove old unqeuzze inputid

* update patch and model config

* remove debugs and clean codes

* clean codes

* add credit

* add update id / fix dependency

* rename modules (#504)

Co-authored-by: grimoire <[email protected]>

* optimize fill kv cache (#523)

* optimize fill kv cache

* update internlm

* faster embedding

* fix bias tp

* fix baichuan2

* fix fill kv cache

* fix lint

---------

* Make trust_remote_code as cli argument (#434)

* trust_remote_code_argument

* format

* update tokenizer

* optimize rotary

* wtf

* Support Falcon models (#406)

* move q seq info into context

* falcon aligned

* trust_remote_code_argument

* fix for falcon

* comment out debugs

* comment out debugs

* use position id in context

* remove codes in falcon model

* Revert "comment out debugs"

This reverts commit ee26a25.

* 7b correct

* 1b aligned

* remove debugs

* patch to ignore position ids

* remove debug in alibi, avoid empty inputs

* fix

* rename dir to replace to "models"

* use position_id and new fill kernel

* remove useless get_prompt func

* fix batch>2

* Refactor scheduler (#551)

* optimize block manager

* scheduler wip

* finish scheduler

* update engine

* profile pytorch poc (#455)

* profile pytorch poc

* update doc and import if need

* arg

* support profile_throughput.py

* reuse pytorch session

* end session

* Support Tensor parallel on Falcon models (#582)

* tp falcon 1b and 7b works

* remove debugs

* update copyright

* add some comments

* remove a debug

* support new hub models

* support 40b

* support 40b model config

* try

* recover

* fix remain len

* Apply rotary kernel (#572)

* apply rotary kernel

* format

* update rmsnorm

* update rms norm

* better unittest

* add docstring

---------

Co-authored-by: grimoire <[email protected]>

* fix(pytorch_poc): memory cal (#606)

* fix(pytorch_poc): memory cal

* Optimize attention (#597)

* add unittest

* add split k

* add docstring

* fast split k

* optimize load

* manually setup device and stream

* lint

---------

Co-authored-by: grimoire <[email protected]>

* feat(pytorch_poc): implement ReRoPE (#625)

* fix(pytorch_poc): memory cal

* style(pytorch_poc): lint

* style(.pre-commit-config.yaml): update

* style(pytorch_poc): remove useless

* feat(pytorch_poc): llama2 support rerope

* feat(pytorch_poc): fix long input generate

* feat(lmdeploy): add kernel

* feat(lmdeploy): update

* feat(lmdeploy): add rerope implementation

* fix(lmdeploy/pytorch_poc): apply rotary_emb

* fix(lmdeploy): update

* style(pytorch_poc): format

* style(lmdeploy): fix lint

* style(lmdeploy): typo

* style(pytorch_poc): format

* style(pytorch_poc): format

* fix(pytorch_poc): rms_norm add mask

* style(pytorch_poc/kernels): format rerope

* style(pytorch_poc): format rerope attn function description

* style(lmdeploy/pytorch_poc): format

* style(pytorch_poc): add code ref

* style(pytorch_poc): format rerope attn

* Refactor engine (#623)

* add agent

* optimize postprocess

* optimize decoding fill cache

* add docstring

* logit to cuda

* blocksize 128

* optimize pre/post process

* fix postprocess

* cpu pre/post process

* manually setup stream and device

* remove context

* update model agent

* update max session len

* remove tqdm

* update pre/post process

* inplace kernel

* avoid kv_len computation

* flash decoding with one cache

* remove comment

* add warning when no enough resources

* step if has unfinish

* add request manager

* better fill kv cache

* fix fill kv cache

* optimize prefill attention

* refractor

* refactoring...

* add custom output

* use cache

---------

Co-authored-by: grimoire <[email protected]>

* [Feature] w8a8 based on pytorch poc (#595)

* refactor smoothquant and support load w8a8 model by from_pretrained

* add w8a8 docs

* add w8a8 en docs

* add convert_to_qmodules function

---------

Co-authored-by: grimoire <[email protected]>

* feat(lmdeploy): add rerope quantization (#718)

* feat(lmdeploy): add rerope quantization

* feat(lmdeploy): fix review

* [Refactor & Doc] Improve w8a8 and add docstring (#768)

* WIP

* improve w8a8 and add doc string

* add docstring

* add docstring

* fix lint

* rename pytorch poc (#764)

* rename pytorch poc

* fix lint

* add docstring

* add docstring

* refactor patch

* add recompute eviction support

* recovery modeling

* add docstring

* Unified paging (#860)

* change 'model_format' to 'qwen' when 'model_name' starts with 'qwen' (#575)

* avoid split chinese characters during decoding (#566)

* add solar chat template (#576)

* robust incremental decode for leading space (#581)

* robust incremental decode for leading space

* speed up lookup as prefix_space_tokens is shorter than no_prefix_space_tokens

* add UT and fix qwen stuff

* update solar chat template (#587)

* Revert "[Docs] Simplify `build.md` (#370)" (#586)

This reverts commit 4b5c2bd.

* Fix crash and remove `sys_instruct` from `chat.py` and `client.py`(#591)

* fix crash

* update profile_generation.py

* format

* use self.bos_id

* remove sys_instruct

* bump version to v0.0.12 (#604)

* Add "build from docker" section (#602)

* add build from docker section

* update

* install python package

* update

* update

* update

* Add more user-friendly CLI  (#541)

* add

* import fire in main

* wrap to speed up fire cli

* update

* update docs

* update docs

* fix

* resolve commennts

* resolve confict and add test for cli

* support inference a batch of prompts (#467)

* support inference a batch of prompts

* docstring and assert

* bump version to v0.0.13 (#620)

* Improve api_server and webui usage (#544)

* make IPv6 compatible, safe run for coroutine interrupting

* instance_id -> session_id and fix api_client.py

* update doc

* remove useless faq

* safe ip mapping

* update app.py

* WIP completion

* completion

* update doc

* disable interactive mode for /v1/chat/completions

* docstring

* docstring

* refactor gradio

* update gradio

* udpate

* update doc

* rename

* session_id default -1

* missed two files

* add a APIClient

* add chat func for APIClient

* refine

* add concurrent function

* sequence_start, sequence_end --> interactive_mode

* update doc

* comments

* doc

* better text completion

* remove /v1/embeddings

* comments

* deprecate generate and use /v1/interactive/completions

* /v1/interactive/completion -> /v1/chat/interactive

* embeddings

* rename

* remove wrong arg description

* docstring

* fix

* update cli

* update doc

* strict session_len limit condition

* pass model args to api_server

* fix: gradio gr.Button.update deprecated after 4.0.0 (#637)

* add cli to list the supported model names (#639)

* update

* resolve comment

* Refactor model conversion (#296)

* split deploy.py

* fix get_cuda_tensor

* deploy qwen_awq

* fix lint

* add docstring

* fix

* support baichuan/baichuan-awq

* parameterizing size_per_head

* remove try/except

* limit input model_format

* add quant_path param

* remove old deploy.py

* fix path

* fix transformer layer range when load bins

* fix qwen init

* split & save log

* relative import

* update get_config

* WeightFileMgr -> Reader

* rename

* update

* fix init_layer_id

* rename llama.py -> meta_llama.py, hf.py -> llama.py

* reduce code

* update arg description

* fix meta llama

* manually cleanup meta model params

* [Enchance] internlm message to prompt (#499)

* update turbomind session_len with model.session_len (#634)

* [Fix] Qwen's quantization results are abnormal & Baichuan cannot be quantized (#605)

* fix awq

* adapt new qwen code

* adapt qwen 14b and baichuan2 7b

* add docstring

* add runtime error for qwen

* FIX: fix stop_session func bug (#578)

* FIX: fix stop_session func bug

* keep sequence_end = False

---------

Co-authored-by: honglei.yan <[email protected]>
Co-authored-by: AllentDan <[email protected]>

* Manage session id using random int for gradio local mode (#553)

* Use session id from gradio state

* use a new session id after reset

* rename session id like a state

* update comments

* reformat files

* init session id on block loaded

* use auto increased session id

* remove session id textbox

* apply to api_server and tritonserver

* update docstring

* add lock for safety

---------

Co-authored-by: AllentDan <[email protected]>

* fix benchmark serving computation mistake (#630)

* fix benchmark serving computation mistake

* fix timestamps computations

* remove speed up

* no mp

* mp seems faster?

* remove

* update

* remove

* fix

* update

* update print log

* typo

* print fist token latency only stream==True

* remove renew_session

* update AsyncEngine

* fix tokenizer_info when convert the model (#661)

* Add check env sub command (#654)

* add check env

* update issue template'

* remove some reqs from check env

* resolve comment

* fix Tokenizer load error when the path of the being-converted  model is not writable (#669)

* Add UltraCM and WizardLM chat templates (#599)

* add ultracm eval chat template

* add WizardLM chat template

* use ultrachat template instead of ultracm usecase

* bump version to v0.0.14 (#663)

* Add extra_requires to reduce dependencies (#580)

* update reqs

* update docs

* resolve comments

* upgrade pydantic

* fix rebase

* update doc

* update

* update

* update readme

* update

* add flash-attn

* TurboMind 2 (#590)

* refresh decoder attention kernel

* block-level kv cache

* `BlockManager` & `SequenceManager`

* update

* update

* update

* update

* rename

* GQA support

* fix context length

* GQA dispatch

* kv8

* tune

* async stream cb

* nvtx

* config parsing

* debug

* optimize output cost

* split-k decoding

* minor

* truncate `session_len` by available blocks

* minor

* license

* fix

* dispatch `cp.async`

* fix linking

* fix

* fix deadlock

* guard input length

* correct start offset

* fix prefill chunking

* fix `cache_block_seq_len` param passing

* fix `block_size` fmtstr

* fix output tokens

* fix batch resizing

* fix masking of finished sequences

* add debug util

* free unused block early

* add ntk scaling and logn scaling

* cmake flags

* fix typo

* w4a16 for sm75

* fix msvc build

* fix msvc build

* fix block verification

* fix msvc build

* use `std::shuffle`

* fix lint

* fix lint

* fix lint

* clear incoming buffer

* clear finished requests

* fix batch initialization

* fix typo

* fix typo

* fix comparison

* [Docs] Update Supported Matrix (#679)

* update supported matrix

* change the default shard size when saving quantized weights

* baichuan2 kv8

* update kv8 docs (#681)

* Fix init of batch state (#682)

* fix init of finished buf

* fix `finished_count`

* fix turbomind stream canceling (#686)

* fix

* instance for each forward

* [Fix] Fix load_checkpoint_in_model bug (#690)

* fix load_checkpoint_in_model bug

* fix comments

* fix comments

* fix bugs

* [Doc] Update restful api doc (#662)

* update restful_api.md

* add a hint

* repeat 3 time

* Fix Tokenizer encode (#645)

* same encode with HF

* sequence_start -> add_bos

* complement

* Fix wrong eos_id and bos_id obtained through grpc api (#644)

* Fix wrong eos_id and bos_id obtained through grpc api

* fix according to review comments

* update

* Optimize for throughput (#701)

* tmp

* update

* update

* optimize for throughput

* update

* fix eos

* clean up

* fix serving

* fix indexed copy

* minor

* minor

---------

Co-authored-by: lvhan028 <[email protected]>

* Check-in user guide about turbomind config (#680)

* update

* update config guide

* update guide

* upate user guide according to review comments

* Replace mmengine with mmengine-lite (#715)

* Support loading hf model directly (#685)

* turbomind support export model params

* fix overflow

* support turbomind.from_pretrained

* fix tp

* support AutoModel

* support load kv qparams

* update auto_awq

* udpate docstring

* export lmdeploy version

* update doc

* remove download_hf_repo

* LmdeployForCausalLM -> LmdeployForCausalLM

* refactor turbomind.py

* update comment

* add bfloat16 convert back

* support gradio run_locl load hf

* support resuful api server load hf

* add docs

* support loading previous quantized model

* adapt pr 690

* udpate docs

* not export turbomind config when quantize a model

* check model_name when can not get it from config.json

* update readme

* remove model_name in auto_awq

* update

* update

* udpate

* fix build

* absolute import

* Fix cache/output length calculation (#738)

* bump version to v0.1.0a0 (#709)

* [Fix] Skip empty batch (#747)

* [Fix] build docker image failed since `packaging` is missing (#753)

* [Fix] Rollback the data type of input_ids to TYPE_UINT32 in preprocessor's proto (#758)

* Set the default value of `max_context_token_num` 1 (#761)

* rename pytorch poc

* fix lint

* add docstring

* add docstring

* refactor patch

* add recompute eviction support

* fix typo (#769)

* add triton server test and workflow yml (#760)

* add triton server test and workflow yml

* update

* revert changes in dockerfile

* update prompts

* recovery modeling

* fix turbomind build on sm<80 (#754)

* fix

* fix lint

* improvement(build): enable ninja and gold linker (#767)

* feat(build): enable ninja and lld

* fix(.github): add ninja installation

* fix(CI): remove dimsize=256

* fix(CI): add option for generate.sh

* fix(docs): update

* Report first-token-latency and token-latency percentiles (#736)

* update profile scripts

* add top_p, top_k and temperature as input arguments

* fix input_ids

* update profile_throughput

* update profile_restful_api

* update profile_serving

* update

* update

* add progress bar

* remove TODO comments

* update

* remove useless profile_* argument

* remove log level

* change concurrency default value to 64

* update restful_api.md

* update according to review comments

* fix docstring

* convert model with hf repo_id (#774)

* bump version to 0.1.0a1 (#776)

* Update benchmark user guide (#763)

* user guide of benchmark generation

* update benchmark generation guide

* update profiling throughput guide

* update profiling api_server guide

* rename file names

* update profile tis user guide

* update

* fix according to review comments

* update

* update according to review comments

* updaste

* add an example

* update

* add docstring

* add unified paging attention support

* refactor block manager

* do not alloc zero

* Fix early exit condition in attention kernel (#788)

* add chat template for Yi (#779)

* Fix missed arguments when benchmark static inference performance (#787)

* minor fix in the profile scripts and docs

* miss arguments

* typo

* fix lint

* update

* Unify prefill & decode passes (#775)

* Unify prefill and decode passes

* dynamic split-fuse

* refactor

* correct input count calculation

* remove unused

* lint

* lint

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* add cuda12.1 build check ci (#782)

* update cuda12.1 build check ci

* use matrix

* auto upload cuda12.1 python pkg to release when create new tag (#784)

* add cuda12-whl-release ci

* enable environment

* test py310-311 windows wheel

* fix py310, py311 setup.py error on windows

* fix lint

* fix extra colon in InternLMChat7B (#796)

* fix local kv head num (#806)

* Report the inference benchmark of models with different size (#794)

* update test scripts for models with different sizes

* update

* only test after tunning gemm

* chmod +x

* fix typo

* benchmark on a100

* fix typo

* fix typo

* per-token latency percentile in profile_throughput

* fix

* fix

* rename

* make the script accept parameters

* minor fix

* indent

* reformat table

* change to 3000

* minor fix

* bump version to v0.1.0a2 (#807)

* fix out of bounds access (#809)

* update scheduler

* optimize request

* Simplify block manager (#812)

* simplify block manager

* fix lint

* set smem size for repetition penalty kernel (#818)

* add mbgemm&mbgemv

* fix recompute, fix mbgmm

---------

Co-authored-by: Lyu Han <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: pppppM <[email protected]>
Co-authored-by: Chen Xin <[email protected]>
Co-authored-by: RunningLeon <[email protected]>
Co-authored-by: Yam(长琴) <[email protected]>
Co-authored-by: liukuikun <[email protected]>
Co-authored-by: yunzhongyan0 <[email protected]>
Co-authored-by: honglei.yan <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: aisensiy <[email protected]>
Co-authored-by: Li Zhang <[email protected]>
Co-authored-by: whcao <[email protected]>
Co-authored-by: Zaida Zhou <[email protected]>
Co-authored-by: tpoisonooo <[email protected]>
Co-authored-by: Qian Zhao <[email protected]>

* [Fix] Adapt to the pyTorch poc branch (#863)

* Adapt to the pyTorch poc branch

* Adapt to the pyTorch poc branch

* fix comments

* update model

* update benchmark

* [Fix] Fix conflicts in `lite` (#878)

* cherry-pick Fix meta tensor error commits

* fix smooth quant

---------

Co-authored-by: pppppM <[email protected]>

* [Feature] Support w8a8 tp (#888)

* fix smooth quant save_pretrained

* support w8a8 tp

* change weight and bias in QLinear back to buffer

* remove debug codes and add comments

* fix message step update

* update docs

---------
Co-authored-by: grimoire <[email protected]>
Co-authored-by: WRH <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: tpoisonooo <[email protected]>
Co-authored-by: whcao <[email protected]>
Co-authored-by: pppppM <[email protected]>
Co-authored-by: Chen Xin <[email protected]>
Co-authored-by: RunningLeon <[email protected]>
Co-authored-by: Yam(长琴) <[email protected]>
Co-authored-by: liukuikun <[email protected]>
Co-authored-by: yunzhongyan0 <[email protected]>
Co-authored-by: honglei.yan <[email protected]>
Co-authored-by: aisensiy <[email protected]>
Co-authored-by: Li Zhang <[email protected]>
Co-authored-by: Zaida Zhou <[email protected]>
Co-authored-by: Qian Zhao <[email protected]>
lvhan028 added a commit that referenced this pull request Jan 9, 2024
* WIP

* cache engine wip

* finish cache engine

* fix cache and scheduler

* add paged attention

* step and stop

* add infer

* add request process

* fix end

* request without schedulersession

* add logits processor

* better context

* update patch

* [Improve] Use 4d input in pytorch poc (#371)

* 4D input, model.eval and llama config

* use auto dtype

* tp wip

* almost

* update logger

* run_check=false

* little optimize

current best

redist w/o dtensor

host mem in que

less rewrite

less code

update model weight

* share attention forward

* fix end

* Support Baichuan (#382)

* add baichuan WIP

* support baichuan

* support baichuan-13b

* fix

* add chat template

* lint

* comments

* fix

* Move `q_seq_info` into `context` (#398)

* move q seq info into context

* remove debugs

* remove debugs

* alibi wip

* add alibi

* reduce logic block (#435)

* add docstring

* add baichuan lint (#445)

* add fill cache back

* support internlm

* fix path of weight index

* Support chatglm2 in pytorch_poc (#360)

* draft support for chatglm2

* debug llama

* gitignore

* update input_id

* better patching

* patch chatglm2 model

* fix after merge

* remove inits

* q_seq_info & remove some debug & orig_self

* remove old unqeuzze inputid

* update patch and model config

* remove debugs and clean codes

* clean codes

* add credit

* add update id / fix dependency

* rename modules (#504)

Co-authored-by: grimoire <[email protected]>

* optimize fill kv cache (#523)

* optimize fill kv cache

* update internlm

* faster embedding

* fix bias tp

* fix baichuan2

* fix fill kv cache

* fix lint

---------

* Make trust_remote_code as cli argument (#434)

* trust_remote_code_argument

* format

* update tokenizer

* optimize rotary

* wtf

* Support Falcon models (#406)

* move q seq info into context

* falcon aligned

* trust_remote_code_argument

* fix for falcon

* comment out debugs

* comment out debugs

* use position id in context

* remove codes in falcon model

* Revert "comment out debugs"

This reverts commit ee26a25.

* 7b correct

* 1b aligned

* remove debugs

* patch to ignore position ids

* remove debug in alibi, avoid empty inputs

* fix

* rename dir to replace to "models"

* use position_id and new fill kernel

* remove useless get_prompt func

* fix batch>2

* Refactor scheduler (#551)

* optimize block manager

* scheduler wip

* finish scheduler

* update engine

* profile pytorch poc (#455)

* profile pytorch poc

* update doc and import if need

* arg

* support profile_throughput.py

* reuse pytorch session

* end session

* Support Tensor parallel on Falcon models (#582)

* tp falcon 1b and 7b works

* remove debugs

* update copyright

* add some comments

* remove a debug

* support new hub models

* support 40b

* support 40b model config

* try

* recover

* fix remain len

* Apply rotary kernel (#572)

* apply rotary kernel

* format

* update rmsnorm

* update rms norm

* better unittest

* add docstring

---------

Co-authored-by: grimoire <[email protected]>

* fix(pytorch_poc): memory cal (#606)

* fix(pytorch_poc): memory cal

* Optimize attention (#597)

* add unittest

* add split k

* add docstring

* fast split k

* optimize load

* manually setup device and stream

* lint

---------

Co-authored-by: grimoire <[email protected]>

* feat(pytorch_poc): implement ReRoPE (#625)

* fix(pytorch_poc): memory cal

* style(pytorch_poc): lint

* style(.pre-commit-config.yaml): update

* style(pytorch_poc): remove useless

* feat(pytorch_poc): llama2 support rerope

* feat(pytorch_poc): fix long input generate

* feat(lmdeploy): add kernel

* feat(lmdeploy): update

* feat(lmdeploy): add rerope implementation

* fix(lmdeploy/pytorch_poc): apply rotary_emb

* fix(lmdeploy): update

* style(pytorch_poc): format

* style(lmdeploy): fix lint

* style(lmdeploy): typo

* style(pytorch_poc): format

* style(pytorch_poc): format

* fix(pytorch_poc): rms_norm add mask

* style(pytorch_poc/kernels): format rerope

* style(pytorch_poc): format rerope attn function description

* style(lmdeploy/pytorch_poc): format

* style(pytorch_poc): add code ref

* style(pytorch_poc): format rerope attn

* Refactor engine (#623)

* add agent

* optimize postprocess

* optimize decoding fill cache

* add docstring

* logit to cuda

* blocksize 128

* optimize pre/post process

* fix postprocess

* cpu pre/post process

* manually setup stream and device

* remove context

* update model agent

* update max session len

* remove tqdm

* update pre/post process

* inplace kernel

* avoid kv_len computation

* flash decoding with one cache

* remove comment

* add warning when no enough resources

* step if has unfinish

* add request manager

* better fill kv cache

* fix fill kv cache

* optimize prefill attention

* refractor

* refactoring...

* add custom output

* use cache

---------

Co-authored-by: grimoire <[email protected]>

* [Feature] w8a8 based on pytorch poc (#595)

* refactor smoothquant and support load w8a8 model by from_pretrained

* add w8a8 docs

* add w8a8 en docs

* add convert_to_qmodules function

---------

Co-authored-by: grimoire <[email protected]>

* feat(lmdeploy): add rerope quantization (#718)

* feat(lmdeploy): add rerope quantization

* feat(lmdeploy): fix review

* [Refactor & Doc] Improve w8a8 and add docstring (#768)

* WIP

* improve w8a8 and add doc string

* add docstring

* add docstring

* fix lint

* rename pytorch poc (#764)

* rename pytorch poc

* fix lint

* add docstring

* add docstring

* refactor patch

* add recompute eviction support

* recovery modeling

* add docstring

* Unified paging (#860)

* change 'model_format' to 'qwen' when 'model_name' starts with 'qwen' (#575)

* avoid split chinese characters during decoding (#566)

* add solar chat template (#576)

* robust incremental decode for leading space (#581)

* robust incremental decode for leading space

* speed up lookup as prefix_space_tokens is shorter than no_prefix_space_tokens

* add UT and fix qwen stuff

* update solar chat template (#587)

* Revert "[Docs] Simplify `build.md` (#370)" (#586)

This reverts commit 4b5c2bd.

* Fix crash and remove `sys_instruct` from `chat.py` and `client.py`(#591)

* fix crash

* update profile_generation.py

* format

* use self.bos_id

* remove sys_instruct

* bump version to v0.0.12 (#604)

* Add "build from docker" section (#602)

* add build from docker section

* update

* install python package

* update

* update

* update

* Add more user-friendly CLI  (#541)

* add

* import fire in main

* wrap to speed up fire cli

* update

* update docs

* update docs

* fix

* resolve commennts

* resolve confict and add test for cli

* support inference a batch of prompts (#467)

* support inference a batch of prompts

* docstring and assert

* bump version to v0.0.13 (#620)

* Improve api_server and webui usage (#544)

* make IPv6 compatible, safe run for coroutine interrupting

* instance_id -> session_id and fix api_client.py

* update doc

* remove useless faq

* safe ip mapping

* update app.py

* WIP completion

* completion

* update doc

* disable interactive mode for /v1/chat/completions

* docstring

* docstring

* refactor gradio

* update gradio

* udpate

* update doc

* rename

* session_id default -1

* missed two files

* add a APIClient

* add chat func for APIClient

* refine

* add concurrent function

* sequence_start, sequence_end --> interactive_mode

* update doc

* comments

* doc

* better text completion

* remove /v1/embeddings

* comments

* deprecate generate and use /v1/interactive/completions

* /v1/interactive/completion -> /v1/chat/interactive

* embeddings

* rename

* remove wrong arg description

* docstring

* fix

* update cli

* update doc

* strict session_len limit condition

* pass model args to api_server

* fix: gradio gr.Button.update deprecated after 4.0.0 (#637)

* add cli to list the supported model names (#639)

* update

* resolve comment

* Refactor model conversion (#296)

* split deploy.py

* fix get_cuda_tensor

* deploy qwen_awq

* fix lint

* add docstring

* fix

* support baichuan/baichuan-awq

* parameterizing size_per_head

* remove try/except

* limit input model_format

* add quant_path param

* remove old deploy.py

* fix path

* fix transformer layer range when load bins

* fix qwen init

* split & save log

* relative import

* update get_config

* WeightFileMgr -> Reader

* rename

* update

* fix init_layer_id

* rename llama.py -> meta_llama.py, hf.py -> llama.py

* reduce code

* update arg description

* fix meta llama

* manually cleanup meta model params

* [Enchance] internlm message to prompt (#499)

* update turbomind session_len with model.session_len (#634)

* [Fix] Qwen's quantization results are abnormal & Baichuan cannot be quantized (#605)

* fix awq

* adapt new qwen code

* adapt qwen 14b and baichuan2 7b

* add docstring

* add runtime error for qwen

* FIX: fix stop_session func bug (#578)

* FIX: fix stop_session func bug

* keep sequence_end = False

---------

Co-authored-by: honglei.yan <[email protected]>
Co-authored-by: AllentDan <[email protected]>

* Manage session id using random int for gradio local mode (#553)

* Use session id from gradio state

* use a new session id after reset

* rename session id like a state

* update comments

* reformat files

* init session id on block loaded

* use auto increased session id

* remove session id textbox

* apply to api_server and tritonserver

* update docstring

* add lock for safety

---------

Co-authored-by: AllentDan <[email protected]>

* fix benchmark serving computation mistake (#630)

* fix benchmark serving computation mistake

* fix timestamps computations

* remove speed up

* no mp

* mp seems faster?

* remove

* update

* remove

* fix

* update

* update print log

* typo

* print fist token latency only stream==True

* remove renew_session

* update AsyncEngine

* fix tokenizer_info when convert the model (#661)

* Add check env sub command (#654)

* add check env

* update issue template'

* remove some reqs from check env

* resolve comment

* fix Tokenizer load error when the path of the being-converted  model is not writable (#669)

* Add UltraCM and WizardLM chat templates (#599)

* add ultracm eval chat template

* add WizardLM chat template

* use ultrachat template instead of ultracm usecase

* bump version to v0.0.14 (#663)

* Add extra_requires to reduce dependencies (#580)

* update reqs

* update docs

* resolve comments

* upgrade pydantic

* fix rebase

* update doc

* update

* update

* update readme

* update

* add flash-attn

* TurboMind 2 (#590)

* refresh decoder attention kernel

* block-level kv cache

* `BlockManager` & `SequenceManager`

* update

* update

* update

* update

* rename

* GQA support

* fix context length

* GQA dispatch

* kv8

* tune

* async stream cb

* nvtx

* config parsing

* debug

* optimize output cost

* split-k decoding

* minor

* truncate `session_len` by available blocks

* minor

* license

* fix

* dispatch `cp.async`

* fix linking

* fix

* fix deadlock

* guard input length

* correct start offset

* fix prefill chunking

* fix `cache_block_seq_len` param passing

* fix `block_size` fmtstr

* fix output tokens

* fix batch resizing

* fix masking of finished sequences

* add debug util

* free unused block early

* add ntk scaling and logn scaling

* cmake flags

* fix typo

* w4a16 for sm75

* fix msvc build

* fix msvc build

* fix block verification

* fix msvc build

* use `std::shuffle`

* fix lint

* fix lint

* fix lint

* clear incoming buffer

* clear finished requests

* fix batch initialization

* fix typo

* fix typo

* fix comparison

* [Docs] Update Supported Matrix (#679)

* update supported matrix

* change the default shard size when saving quantized weights

* baichuan2 kv8

* update kv8 docs (#681)

* Fix init of batch state (#682)

* fix init of finished buf

* fix `finished_count`

* fix turbomind stream canceling (#686)

* fix

* instance for each forward

* [Fix] Fix load_checkpoint_in_model bug (#690)

* fix load_checkpoint_in_model bug

* fix comments

* fix comments

* fix bugs

* [Doc] Update restful api doc (#662)

* update restful_api.md

* add a hint

* repeat 3 time

* Fix Tokenizer encode (#645)

* same encode with HF

* sequence_start -> add_bos

* complement

* Fix wrong eos_id and bos_id obtained through grpc api (#644)

* Fix wrong eos_id and bos_id obtained through grpc api

* fix according to review comments

* update

* Optimize for throughput (#701)

* tmp

* update

* update

* optimize for throughput

* update

* fix eos

* clean up

* fix serving

* fix indexed copy

* minor

* minor

---------

Co-authored-by: lvhan028 <[email protected]>

* Check-in user guide about turbomind config (#680)

* update

* update config guide

* update guide

* upate user guide according to review comments

* Replace mmengine with mmengine-lite (#715)

* Support loading hf model directly (#685)

* turbomind support export model params

* fix overflow

* support turbomind.from_pretrained

* fix tp

* support AutoModel

* support load kv qparams

* update auto_awq

* udpate docstring

* export lmdeploy version

* update doc

* remove download_hf_repo

* LmdeployForCausalLM -> LmdeployForCausalLM

* refactor turbomind.py

* update comment

* add bfloat16 convert back

* support gradio run_locl load hf

* support resuful api server load hf

* add docs

* support loading previous quantized model

* adapt pr 690

* udpate docs

* not export turbomind config when quantize a model

* check model_name when can not get it from config.json

* update readme

* remove model_name in auto_awq

* update

* update

* udpate

* fix build

* absolute import

* Fix cache/output length calculation (#738)

* bump version to v0.1.0a0 (#709)

* [Fix] Skip empty batch (#747)

* [Fix] build docker image failed since `packaging` is missing (#753)

* [Fix] Rollback the data type of input_ids to TYPE_UINT32 in preprocessor's proto (#758)

* Set the default value of `max_context_token_num` 1 (#761)

* rename pytorch poc

* fix lint

* add docstring

* add docstring

* refactor patch

* add recompute eviction support

* fix typo (#769)

* add triton server test and workflow yml (#760)

* add triton server test and workflow yml

* update

* revert changes in dockerfile

* update prompts

* recovery modeling

* fix turbomind build on sm<80 (#754)

* fix

* fix lint

* improvement(build): enable ninja and gold linker (#767)

* feat(build): enable ninja and lld

* fix(.github): add ninja installation

* fix(CI): remove dimsize=256

* fix(CI): add option for generate.sh

* fix(docs): update

* Report first-token-latency and token-latency percentiles (#736)

* update profile scripts

* add top_p, top_k and temperature as input arguments

* fix input_ids

* update profile_throughput

* update profile_restful_api

* update profile_serving

* update

* update

* add progress bar

* remove TODO comments

* update

* remove useless profile_* argument

* remove log level

* change concurrency default value to 64

* update restful_api.md

* update according to review comments

* fix docstring

* convert model with hf repo_id (#774)

* bump version to 0.1.0a1 (#776)

* Update benchmark user guide (#763)

* user guide of benchmark generation

* update benchmark generation guide

* update profiling throughput guide

* update profiling api_server guide

* rename file names

* update profile tis user guide

* update

* fix according to review comments

* update

* update according to review comments

* updaste

* add an example

* update

* add docstring

* add unified paging attention support

* refactor block manager

* do not alloc zero

* Fix early exit condition in attention kernel (#788)

* add chat template for Yi (#779)

* Fix missed arguments when benchmark static inference performance (#787)

* minor fix in the profile scripts and docs

* miss arguments

* typo

* fix lint

* update

* Unify prefill & decode passes (#775)

* Unify prefill and decode passes

* dynamic split-fuse

* refactor

* correct input count calculation

* remove unused

* lint

* lint

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* add cuda12.1 build check ci (#782)

* update cuda12.1 build check ci

* use matrix

* auto upload cuda12.1 python pkg to release when create new tag (#784)

* add cuda12-whl-release ci

* enable environment

* test py310-311 windows wheel

* fix py310, py311 setup.py error on windows

* fix lint

* fix extra colon in InternLMChat7B (#796)

* fix local kv head num (#806)

* Report the inference benchmark of models with different size (#794)

* update test scripts for models with different sizes

* update

* only test after tunning gemm

* chmod +x

* fix typo

* benchmark on a100

* fix typo

* fix typo

* per-token latency percentile in profile_throughput

* fix

* fix

* rename

* make the script accept parameters

* minor fix

* indent

* reformat table

* change to 3000

* minor fix

* bump version to v0.1.0a2 (#807)

* fix out of bounds access (#809)

* update scheduler

* optimize request

* Simplify block manager (#812)

* simplify block manager

* fix lint

* set smem size for repetition penalty kernel (#818)

* add mbgemm&mbgemv

* fix recompute, fix mbgmm

---------

Co-authored-by: Lyu Han <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: pppppM <[email protected]>
Co-authored-by: Chen Xin <[email protected]>
Co-authored-by: RunningLeon <[email protected]>
Co-authored-by: Yam(长琴) <[email protected]>
Co-authored-by: liukuikun <[email protected]>
Co-authored-by: yunzhongyan0 <[email protected]>
Co-authored-by: honglei.yan <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: aisensiy <[email protected]>
Co-authored-by: Li Zhang <[email protected]>
Co-authored-by: whcao <[email protected]>
Co-authored-by: Zaida Zhou <[email protected]>
Co-authored-by: tpoisonooo <[email protected]>
Co-authored-by: Qian Zhao <[email protected]>

* [Fix] Adapt to the pyTorch poc branch (#863)

* Adapt to the pyTorch poc branch

* Adapt to the pyTorch poc branch

* fix comments

* update model

* wip

* wrong implementation

* s-lora single gpu

* refactor tp patch

* add tp support

* add tp gather

* recover profile generation

* daemon process

* inplace gather

* hf style

* add assert when input nothing

* find available port

---------

Co-authored-by: grimoire <[email protected]>
Co-authored-by: WRH <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: tpoisonooo <[email protected]>
Co-authored-by: whcao <[email protected]>
Co-authored-by: Lyu Han <[email protected]>
Co-authored-by: pppppM <[email protected]>
Co-authored-by: Chen Xin <[email protected]>
Co-authored-by: RunningLeon <[email protected]>
Co-authored-by: Yam(长琴) <[email protected]>
Co-authored-by: liukuikun <[email protected]>
Co-authored-by: yunzhongyan0 <[email protected]>
Co-authored-by: honglei.yan <[email protected]>
Co-authored-by: aisensiy <[email protected]>
Co-authored-by: Li Zhang <[email protected]>
Co-authored-by: Zaida Zhou <[email protected]>
Co-authored-by: Qian Zhao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants