Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] No enough blocks. Assertion fail: /root/lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:358 #720

Closed
2 tasks done
dawnranger opened this issue Nov 21, 2023 · 6 comments
Assignees

Comments

@dawnranger
Copy link

dawnranger commented Nov 21, 2023

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

This codump happens when :

  • max_batch_size/cache_max_entry_count =1 in config.ini
  • concurrency = 1 in benchmark scripts
  • max_batch_size/cache_max_entry_count is large and concurrency is small

when concurrency=1, This config failed:

max_batch_size=4
cache_max_entry_count=4

but this config run successfully:

max_batch_size=4
cache_max_entry_count=8

Reproduction

# import multiprocessing as mp
import os.path as osp
import time
from queue import Queue
from threading import Thread
from typing import List
import json

import fire
import numpy as np
import logging
from lmdeploy import turbomind
from lmdeploy.tokenizer import Tokenizer
import os

logger = logging.getLogger(__file__)
logger.setLevel(logging.WARNING)
os.environ['TM_LOG_LEVEL'] = 'ERROR'


def infer(model, tokenizer, session_id: int, task_queue: Queue, output_seqlen: int, info: Queue, result_path: str):

    f = open(f"{result_path}/{session_id}.txt", "w")
    response = ""
    chatbot = model.create_instance()
    stats = []

    while not task_queue.empty():
        start = time.perf_counter()
        text, input_ids = task_queue.get()
        # print('session_id {}, text: {}, input_ids: {}'.format(session_id, text, input_ids))
        cur = ""
        for outputs in chatbot.stream_infer(session_id,
                                            input_ids,
                                            request_output_len=output_seqlen,
                                            sequence_start=True,
                                            sequence_end=True,
                                            step=0,
                                            ignore_eos=False,
                                            temperature=0.1,
                                            top_p=0.75,
                                            top_k=40):
            res, token_cnt = outputs[0]
            if cur != "":
                cur += "\n"
            cur += tokenizer.decode(res)
        end = time.perf_counter()
        stats.append([end - start, token_cnt])

        tmp = ""
        candidates = cur.split('\n')
        for i in range(len(candidates)):
            index = candidates[i].find(".")
            # tmp += text + '\t' + candidates[i].split('\n')[0].strip(' ') + '\n'
            tmp += candidates[i] + '\t' + text + '\n'
        response += tmp

    info.put(stats)
    f.write(response)
    f.close()
    print('session {} done'.format(session_id))


def main(model_path: str = 'baichuan_7b_fp16',
         result_path: str ,
         data_path: str,
         concurrency: int = 4,
         output_seqlen: int = 64,
         tp: int = 1):
    tokenizer_model_path = osp.join(model_path, 'triton_models', 'tokenizer')
    tokenizer = Tokenizer(tokenizer_model_path)
    tm_model = turbomind.TurboMind(model_path=model_path, tp=tp)

    task_queue = Queue()
    infos = Queue()
    cnt = 0

    data = []
    with open(data_path, 'r', encoding='utf-8') as f:
        for line in f:
            result = json.loads(line)
            question = "<reserved_102>" + result['question'] + "<reserved_103>"
            data.append((question, result['text']))
    repeate_times = 4
    data = data * repeate_times

    ave_input_len = 0
    mx_input_len = 0
    for question, text in data:
        input_ids = tokenizer.encode(question)
        input_len = len(input_ids)
        ave_input_len = input_len
        mx_input_len = max(mx_input_len, input_len)
        task_queue.put((text, input_ids))
        cnt += 1
    print(f"ave_input_len {ave_input_len} mx_input_len {mx_input_len}")
    print(cnt)
    procs = []
    _start = time.perf_counter()

    for i in range(concurrency):
        proc = Thread(target=infer, args=(tm_model, tokenizer, i + 1, task_queue, output_seqlen, infos, result_path))
        procs.append(proc)
        proc.start()

    try:
        for proc in procs:
            proc.join()
    except Exception:
        for proc in procs:
            proc.stop()
        exit(1)
    _end = time.perf_counter()
    elapsed_time = _end - _start

    stats = []
    while not infos.empty():
        _stats = infos.get()
        stats.append(_stats)
    lantencys = []
    token_cnts = []
    for stat in stats:
        for lantency, token_cnt in stat:
            lantencys.append(lantency)
            token_cnts.append(token_cnt)
    latency_min = np.min(lantencys)
    latency_max = np.max(lantencys)
    latency_ave = np.mean(lantencys)
    latency_p50 = np.percentile(lantencys, 50)
    latency_p90 = np.percentile(lantencys, 90)
    latency_p99 = np.percentile(lantencys, 99)

    token_cnt_min = np.min(token_cnts)
    token_cnt_max = np.max(token_cnts)
    token_cnt_ave = np.mean(token_cnts)
    token_cnt_p50 = np.percentile(token_cnts, 50)
    token_cnt_p90 = np.percentile(token_cnts, 90)
    token_cnt_p99 = np.percentile(token_cnts, 99)
    total_token_cnt = np.sum(token_cnts)

    print(f"concurrency\t {concurrency}\n"
          f"requests_cnt\t {len(lantencys)}\n"
          f"elapsed_time\t{elapsed_time : .2f}\n"
          f"latency_min\t{latency_min : .3f}\n"
          f"latency_ave\t{latency_ave : .3f}\n"
          f"latency_p50\t{latency_p50 : .3f}\n"
          f"latency_p90\t{latency_p90 : .3f}\n"
          f"latency_p99\t{latency_p99 : .3f}\n"
          f"latency_max\t{latency_max : .3f}\n"
          f"token_cnt_min\t{token_cnt_min : .3f}\n"
          f"token_cnt_ave\t{token_cnt_ave : .3f}\n"
          f"token_cnt_p50\t{token_cnt_p50 : .3f}\n"
          f"token_cnt_p90\t{token_cnt_p90 : .3f}\n"
          f"token_cnt_p99\t{token_cnt_p99 : .3f}\n"
          f"token_cnt_max\t{token_cnt_max : .3f}\n"
          f"tok/gpu/s\t{total_token_cnt / elapsed_time / tp : .2f}\n"
          f"req/gpu/s\t{len(lantencys) / elapsed_time / tp : .2f}\n")

if __name__ == '__main__':
    fire.Fire(main)

Environment

lmdeploy Version: 0.0.14
cuda version: 11.8
os: centos 7
git log:  commit `73386e217cc092e02a5c256e8cf6ce1475b8c6c8`

Error traceback

WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
[TM][WARNING] [LlamaTritonModel] `cache_block_seq_len` is not set, default to 128.
[TM][INFO] Barrier(1)
ave_input_len 71 mx_input_len 518
1600
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[TM][INFO] [BlockManager] block_size = 64 MB
[TM][INFO] [BlockManager] max_block_count = 32
[TM][INFO] [BlockManager] chunk_size = 1
[TM][ERROR] LlamaBatch<T>::Start()
[TM][INFO] [InternalThreadEntry] 0
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][WARNING] [ProcessInferRequests] Request for 1 received.
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] [decodeContext] base = 0, count = 1
[TM][INFO] first = 0, last = 1
[TM][INFO] session_len = 512, input_length = 51
[TM][INFO] [decodeContext] offset = 0, batch_size = 1, token_num = 51, max_input_len = 51, max_context_len = 51
[TM][INFO] context decoding start
[TM][INFO] context decoding end
[TM][INFO] [decodeContext] 29.04 ms
[TM][INFO] [initGen] batch_size = 1
[TM][INFO] [initGen] max_context_len = 52
[TM][INFO] [initGen] slot  sequence_id  context_len  seq_limit_len  finished
[TM][INFO] [initGen]    0            1           52            116         0
[TM][WARNING] [ProcessInferRequests] Request for 2 received.
[TM][INFO] [decodeContext] base = 1, count = 1
[TM][INFO] first = 1, last = 2
[TM][INFO] session_len = 512, input_length = 60
[TM][INFO] [decodeContext] offset = 1, batch_size = 1, token_num = 60, max_input_len = 60, max_context_len = 60
[TM][INFO] context decoding start
[TM][INFO] context decoding end
[TM][INFO] [decodeContext] 28.37 ms
[TM][INFO] [initGen] batch_size = 2
[TM][INFO] [initGen] max_context_len = 61
[TM][INFO] [initGen] slot  sequence_id  context_len  seq_limit_len  finished
[TM][INFO] [initGen]    0            1           53            124         0
[TM][INFO] [initGen]    1            2           61            125         0
[TM][INFO] ------------------------- step = 60 -------------------------
[TM][INFO] [CompleteRequest] slot = 0, id = 1
[TM][INFO] [CompleteRequest] slot = 1, id = 2
[TM][INFO] [forward] Request complete for 2, ec = 0
[TM][INFO] [forward] Request complete for 1, ec = 0
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][WARNING] [ProcessInferRequests] Request for 2 received.
[TM][INFO] [decodeContext] base = 0, count = 1
[TM][INFO] first = 0, last = 1
[TM][INFO] session_len = 512, input_length = 61
[TM][INFO] [decodeContext] offset = 0, batch_size = 1, token_num = 61, max_input_len = 61, max_context_len = 61
[TM][INFO] context decoding start
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] context decoding end
[TM][INFO] [decodeContext] 28.85 ms
[TM][INFO] [initGen] batch_size = 1
[TM][INFO] [initGen] max_context_len = 62
[TM][INFO] [initGen] slot  sequence_id  context_len  seq_limit_len  finished
[TM][INFO] [initGen]    0            2           62            126         0
[TM][WARNING] [ProcessInferRequests] Request for 1 received.
[TM][INFO] [decodeContext] base = 1, count = 1
[TM][INFO] first = 1, last = 2
[TM][INFO] session_len = 512, input_length = 80
[TM][INFO] [decodeContext] offset = 1, batch_size = 1, token_num = 80, max_input_len = 80, max_context_len = 80
[TM][INFO] context decoding start
[TM][INFO] context decoding end
[TM][INFO] [decodeContext] 32.78 ms
[TM][INFO] [initGen] batch_size = 2
[TM][INFO] [initGen] max_context_len = 81
[TM][INFO] [initGen] slot  sequence_id  context_len  seq_limit_len  finished
[TM][INFO] [initGen]    0            2           63            144         0
[TM][INFO] [initGen]    1            1           81            145         0
[TM][INFO] ------------------------- step = 80 -------------------------
[TM][INFO] [CompleteRequest] slot = 0, id = 2
[TM][INFO] [CompleteRequest] slot = 1, id = 1
[TM][INFO] [forward] Request complete for 2, ec = 0
[TM][INFO] [forward] Request complete for 1, ec = 0
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][WARNING] [ProcessInferRequests] Request for 2 received.
[TM][INFO] [decodeContext] base = 0, count = 1
[TM][INFO] first = 0, last = 1
[TM][INFO] session_len = 512, input_length = 170
[TM][INFO] [decodeContext] offset = 0, batch_size = 1, token_num = 170, max_input_len = 170, max_context_len = 170
[TM][INFO] context decoding start
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] context decoding end
[TM][INFO] [decodeContext] 35.66 ms
[TM][INFO] [initGen] batch_size = 1
[TM][INFO] [initGen] max_context_len = 171
[TM][INFO] [initGen] slot  sequence_id  context_len  seq_limit_len  finished
[TM][INFO] [initGen]    0            2          171            235         0
[TM][INFO] ------------------------- step = 170 -------------------------
[TM][WARNING] [ProcessInferRequests] Request for 1 received.
[TM][INFO] [decodeContext] base = 1, count = 1
[TM][INFO] first = 1, last = 2
[TM][INFO] session_len = 512, input_length = 77
[TM][INFO] [decodeContext] offset = 1, batch_size = 1, token_num = 77, max_input_len = 77, max_context_len = 77
[TM][INFO] context decoding start
[TM][INFO] context decoding end
[TM][INFO] [decodeContext] 33.58 ms
[TM][INFO] [initGen] batch_size = 2
[TM][INFO] [initGen] max_context_len = 172
[TM][INFO] [initGen] slot  sequence_id  context_len  seq_limit_len  finished
[TM][INFO] [initGen]    0            2          172            235         0
[TM][INFO] [initGen]    1            1           78            236         0
[TM][INFO] [CompleteRequest] slot = 0, id = 2
[TM][INFO] [CompleteRequest] slot = 1, id = 1
[TM][INFO] [forward] Request complete for 2, ec = 0
[TM][INFO] [forward] Request complete for 1, ec = 0
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][WARNING] [RejectInvalidRequests] Skipping invalid infer request for id 2, code = 6
[TM][INFO] [forward] Request complete for 2, ec = 6
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] No enough blocks. Assertion fail: /root/lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:358 

[TM][INFO] [forward] Enqueue requests
Aborted (core dumped)
@lvhan028 lvhan028 self-assigned this Nov 21, 2023
@lvhan028
Copy link
Collaborator

@dawnranger
Copy link
Author

Thank you for your reply. I've read the guide, but I'm still confused.

  1. Are there any suggested parameters for the kv cache memory setting?
    In the TurboMind 2.0 k/v cache size section, it mentions that the k/v cache memory is determined by cache_block_seq_len, cache_max_entry_count, and cache_chunk_size. However, I'm unsure about the appropriate values for these parameters. I would greatly appreciate any guidance on how to set these parameters.

  2. Does this error indicate that the k/v cache memory is insufficient?
    Following the guide, I adjusted the cache_block_seq_len, cache_max_entry_count, and cache_chunk_size parameters. However, I found that when concurrency=1, regardless of how I modify these parameters, the program inevitably fails with a No enough blocks error. Why would the k/v cache memory be affected by client concurrency?

@lvhan028
Copy link
Collaborator

cache_block_seq_len: 128
cache_max_entry_count: 0.5 (it means k/v cache will occupy 50% mem of a GPU card at the most)
cache_chunk_size: 1

@lvhan028
Copy link
Collaborator

I will try to reproduce this issue tomorrow

@lzhangzz
Copy link
Collaborator

This seems to be an issue introduced in #590, you may try to use the latest main branch to see if it helps.

@dawnranger
Copy link
Author

Problem resolved after I updated the code to the latest main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants