[Bug] No enough blocks. Assertion fail: /root/lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:358 #720

dawnranger · 2023-11-21T09:15:12Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.

Describe the bug

This codump happens when :

max_batch_size/cache_max_entry_count =1 in config.ini
concurrency = 1 in benchmark scripts
max_batch_size/cache_max_entry_count is large and concurrency is small

when concurrency=1, This config failed:

max_batch_size=4
cache_max_entry_count=4

but this config run successfully:

max_batch_size=4
cache_max_entry_count=8

Reproduction

# import multiprocessing as mp
import os.path as osp
import time
from queue import Queue
from threading import Thread
from typing import List
import json

import fire
import numpy as np
import logging
from lmdeploy import turbomind
from lmdeploy.tokenizer import Tokenizer
import os

logger = logging.getLogger(__file__)
logger.setLevel(logging.WARNING)
os.environ['TM_LOG_LEVEL'] = 'ERROR'


def infer(model, tokenizer, session_id: int, task_queue: Queue, output_seqlen: int, info: Queue, result_path: str):

    f = open(f"{result_path}/{session_id}.txt", "w")
    response = ""
    chatbot = model.create_instance()
    stats = []

    while not task_queue.empty():
        start = time.perf_counter()
        text, input_ids = task_queue.get()
        # print('session_id {}, text: {}, input_ids: {}'.format(session_id, text, input_ids))
        cur = ""
        for outputs in chatbot.stream_infer(session_id,
                                            input_ids,
                                            request_output_len=output_seqlen,
                                            sequence_start=True,
                                            sequence_end=True,
                                            step=0,
                                            ignore_eos=False,
                                            temperature=0.1,
                                            top_p=0.75,
                                            top_k=40):
            res, token_cnt = outputs[0]
            if cur != "":
                cur += "\n"
            cur += tokenizer.decode(res)
        end = time.perf_counter()
        stats.append([end - start, token_cnt])

        tmp = ""
        candidates = cur.split('\n')
        for i in range(len(candidates)):
            index = candidates[i].find(".")
            # tmp += text + '\t' + candidates[i].split('\n')[0].strip(' ') + '\n'
            tmp += candidates[i] + '\t' + text + '\n'
        response += tmp

    info.put(stats)
    f.write(response)
    f.close()
    print('session {} done'.format(session_id))


def main(model_path: str = 'baichuan_7b_fp16',
         result_path: str ,
         data_path: str,
         concurrency: int = 4,
         output_seqlen: int = 64,
         tp: int = 1):
    tokenizer_model_path = osp.join(model_path, 'triton_models', 'tokenizer')
    tokenizer = Tokenizer(tokenizer_model_path)
    tm_model = turbomind.TurboMind(model_path=model_path, tp=tp)

    task_queue = Queue()
    infos = Queue()
    cnt = 0

    data = []
    with open(data_path, 'r', encoding='utf-8') as f:
        for line in f:
            result = json.loads(line)
            question = "<reserved_102>" + result['question'] + "<reserved_103>"
            data.append((question, result['text']))
    repeate_times = 4
    data = data * repeate_times

    ave_input_len = 0
    mx_input_len = 0
    for question, text in data:
        input_ids = tokenizer.encode(question)
        input_len = len(input_ids)
        ave_input_len = input_len
        mx_input_len = max(mx_input_len, input_len)
        task_queue.put((text, input_ids))
        cnt += 1
    print(f"ave_input_len {ave_input_len} mx_input_len {mx_input_len}")
    print(cnt)
    procs = []
    _start = time.perf_counter()

    for i in range(concurrency):
        proc = Thread(target=infer, args=(tm_model, tokenizer, i + 1, task_queue, output_seqlen, infos, result_path))
        procs.append(proc)
        proc.start()

    try:
        for proc in procs:
            proc.join()
    except Exception:
        for proc in procs:
            proc.stop()
        exit(1)
    _end = time.perf_counter()
    elapsed_time = _end - _start

    stats = []
    while not infos.empty():
        _stats = infos.get()
        stats.append(_stats)
    lantencys = []
    token_cnts = []
    for stat in stats:
        for lantency, token_cnt in stat:
            lantencys.append(lantency)
            token_cnts.append(token_cnt)
    latency_min = np.min(lantencys)
    latency_max = np.max(lantencys)
    latency_ave = np.mean(lantencys)
    latency_p50 = np.percentile(lantencys, 50)
    latency_p90 = np.percentile(lantencys, 90)
    latency_p99 = np.percentile(lantencys, 99)

    token_cnt_min = np.min(token_cnts)
    token_cnt_max = np.max(token_cnts)
    token_cnt_ave = np.mean(token_cnts)
    token_cnt_p50 = np.percentile(token_cnts, 50)
    token_cnt_p90 = np.percentile(token_cnts, 90)
    token_cnt_p99 = np.percentile(token_cnts, 99)
    total_token_cnt = np.sum(token_cnts)

    print(f"concurrency\t {concurrency}\n"
          f"requests_cnt\t {len(lantencys)}\n"
          f"elapsed_time\t{elapsed_time : .2f}\n"
          f"latency_min\t{latency_min : .3f}\n"
          f"latency_ave\t{latency_ave : .3f}\n"
          f"latency_p50\t{latency_p50 : .3f}\n"
          f"latency_p90\t{latency_p90 : .3f}\n"
          f"latency_p99\t{latency_p99 : .3f}\n"
          f"latency_max\t{latency_max : .3f}\n"
          f"token_cnt_min\t{token_cnt_min : .3f}\n"
          f"token_cnt_ave\t{token_cnt_ave : .3f}\n"
          f"token_cnt_p50\t{token_cnt_p50 : .3f}\n"
          f"token_cnt_p90\t{token_cnt_p90 : .3f}\n"
          f"token_cnt_p99\t{token_cnt_p99 : .3f}\n"
          f"token_cnt_max\t{token_cnt_max : .3f}\n"
          f"tok/gpu/s\t{total_token_cnt / elapsed_time / tp : .2f}\n"
          f"req/gpu/s\t{len(lantencys) / elapsed_time / tp : .2f}\n")

if __name__ == '__main__':
    fire.Fire(main)

Environment

lmdeploy Version: 0.0.14
cuda version: 11.8
os: centos 7
git log:  commit `73386e217cc092e02a5c256e8cf6ce1475b8c6c8`

Error traceback

WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
[TM][WARNING] [LlamaTritonModel] `cache_block_seq_len` is not set, default to 128.
[TM][INFO] Barrier(1)
ave_input_len 71 mx_input_len 518
1600
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[TM][INFO] [BlockManager] block_size = 64 MB
[TM][INFO] [BlockManager] max_block_count = 32
[TM][INFO] [BlockManager] chunk_size = 1
[TM][ERROR] LlamaBatch<T>::Start()
[TM][INFO] [InternalThreadEntry] 0
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][WARNING] [ProcessInferRequests] Request for 1 received.
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] [decodeContext] base = 0, count = 1
[TM][INFO] first = 0, last = 1
[TM][INFO] session_len = 512, input_length = 51
[TM][INFO] [decodeContext] offset = 0, batch_size = 1, token_num = 51, max_input_len = 51, max_context_len = 51
[TM][INFO] context decoding start
[TM][INFO] context decoding end
[TM][INFO] [decodeContext] 29.04 ms
[TM][INFO] [initGen] batch_size = 1
[TM][INFO] [initGen] max_context_len = 52
[TM][INFO] [initGen] slot  sequence_id  context_len  seq_limit_len  finished
[TM][INFO] [initGen]    0            1           52            116         0
[TM][WARNING] [ProcessInferRequests] Request for 2 received.
[TM][INFO] [decodeContext] base = 1, count = 1
[TM][INFO] first = 1, last = 2
[TM][INFO] session_len = 512, input_length = 60
[TM][INFO] [decodeContext] offset = 1, batch_size = 1, token_num = 60, max_input_len = 60, max_context_len = 60
[TM][INFO] context decoding start
[TM][INFO] context decoding end
[TM][INFO] [decodeContext] 28.37 ms
[TM][INFO] [initGen] batch_size = 2
[TM][INFO] [initGen] max_context_len = 61
[TM][INFO] [initGen] slot  sequence_id  context_len  seq_limit_len  finished
[TM][INFO] [initGen]    0            1           53            124         0
[TM][INFO] [initGen]    1            2           61            125         0
[TM][INFO] ------------------------- step = 60 -------------------------
[TM][INFO] [CompleteRequest] slot = 0, id = 1
[TM][INFO] [CompleteRequest] slot = 1, id = 2
[TM][INFO] [forward] Request complete for 2, ec = 0
[TM][INFO] [forward] Request complete for 1, ec = 0
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][WARNING] [ProcessInferRequests] Request for 2 received.
[TM][INFO] [decodeContext] base = 0, count = 1
[TM][INFO] first = 0, last = 1
[TM][INFO] session_len = 512, input_length = 61
[TM][INFO] [decodeContext] offset = 0, batch_size = 1, token_num = 61, max_input_len = 61, max_context_len = 61
[TM][INFO] context decoding start
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] context decoding end
[TM][INFO] [decodeContext] 28.85 ms
[TM][INFO] [initGen] batch_size = 1
[TM][INFO] [initGen] max_context_len = 62
[TM][INFO] [initGen] slot  sequence_id  context_len  seq_limit_len  finished
[TM][INFO] [initGen]    0            2           62            126         0
[TM][WARNING] [ProcessInferRequests] Request for 1 received.
[TM][INFO] [decodeContext] base = 1, count = 1
[TM][INFO] first = 1, last = 2
[TM][INFO] session_len = 512, input_length = 80
[TM][INFO] [decodeContext] offset = 1, batch_size = 1, token_num = 80, max_input_len = 80, max_context_len = 80
[TM][INFO] context decoding start
[TM][INFO] context decoding end
[TM][INFO] [decodeContext] 32.78 ms
[TM][INFO] [initGen] batch_size = 2
[TM][INFO] [initGen] max_context_len = 81
[TM][INFO] [initGen] slot  sequence_id  context_len  seq_limit_len  finished
[TM][INFO] [initGen]    0            2           63            144         0
[TM][INFO] [initGen]    1            1           81            145         0
[TM][INFO] ------------------------- step = 80 -------------------------
[TM][INFO] [CompleteRequest] slot = 0, id = 2
[TM][INFO] [CompleteRequest] slot = 1, id = 1
[TM][INFO] [forward] Request complete for 2, ec = 0
[TM][INFO] [forward] Request complete for 1, ec = 0
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][WARNING] [ProcessInferRequests] Request for 2 received.
[TM][INFO] [decodeContext] base = 0, count = 1
[TM][INFO] first = 0, last = 1
[TM][INFO] session_len = 512, input_length = 170
[TM][INFO] [decodeContext] offset = 0, batch_size = 1, token_num = 170, max_input_len = 170, max_context_len = 170
[TM][INFO] context decoding start
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] context decoding end
[TM][INFO] [decodeContext] 35.66 ms
[TM][INFO] [initGen] batch_size = 1
[TM][INFO] [initGen] max_context_len = 171
[TM][INFO] [initGen] slot  sequence_id  context_len  seq_limit_len  finished
[TM][INFO] [initGen]    0            2          171            235         0
[TM][INFO] ------------------------- step = 170 -------------------------
[TM][WARNING] [ProcessInferRequests] Request for 1 received.
[TM][INFO] [decodeContext] base = 1, count = 1
[TM][INFO] first = 1, last = 2
[TM][INFO] session_len = 512, input_length = 77
[TM][INFO] [decodeContext] offset = 1, batch_size = 1, token_num = 77, max_input_len = 77, max_context_len = 77
[TM][INFO] context decoding start
[TM][INFO] context decoding end
[TM][INFO] [decodeContext] 33.58 ms
[TM][INFO] [initGen] batch_size = 2
[TM][INFO] [initGen] max_context_len = 172
[TM][INFO] [initGen] slot  sequence_id  context_len  seq_limit_len  finished
[TM][INFO] [initGen]    0            2          172            235         0
[TM][INFO] [initGen]    1            1           78            236         0
[TM][INFO] [CompleteRequest] slot = 0, id = 2
[TM][INFO] [CompleteRequest] slot = 1, id = 1
[TM][INFO] [forward] Request complete for 2, ec = 0
[TM][INFO] [forward] Request complete for 1, ec = 0
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][WARNING] [RejectInvalidRequests] Skipping invalid infer request for id 2, code = 6
[TM][INFO] [forward] Request complete for 2, ec = 6
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] No enough blocks. Assertion fail: /root/lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:358 

[TM][INFO] [forward] Enqueue requests
Aborted (core dumped)

The text was updated successfully, but these errors were encountered:

lvhan028 · 2023-11-21T09:37:14Z

Hope this guide can help you
https://github.com/InternLM/lmdeploy/blob/main/docs/en/turbomind_config.md

dawnranger · 2023-11-21T11:33:47Z

Thank you for your reply. I've read the guide, but I'm still confused.

Are there any suggested parameters for the kv cache memory setting?
In the TurboMind 2.0 k/v cache size section, it mentions that the k/v cache memory is determined by cache_block_seq_len, cache_max_entry_count, and cache_chunk_size. However, I'm unsure about the appropriate values for these parameters. I would greatly appreciate any guidance on how to set these parameters.
Does this error indicate that the k/v cache memory is insufficient?
Following the guide, I adjusted the cache_block_seq_len, cache_max_entry_count, and cache_chunk_size parameters. However, I found that when concurrency=1, regardless of how I modify these parameters, the program inevitably fails with a No enough blocks error. Why would the k/v cache memory be affected by client concurrency?

lvhan028 · 2023-11-21T12:13:40Z

cache_block_seq_len: 128
cache_max_entry_count: 0.5 (it means k/v cache will occupy 50% mem of a GPU card at the most)
cache_chunk_size: 1

lvhan028 · 2023-11-21T13:05:20Z

I will try to reproduce this issue tomorrow

lzhangzz · 2023-11-21T15:34:12Z

This seems to be an issue introduced in #590, you may try to use the latest main branch to see if it helps.

dawnranger · 2023-11-24T06:49:24Z

Problem resolved after I updated the code to the latest main branch.

lvhan028 self-assigned this Nov 21, 2023

dawnranger closed this as completed Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] No enough blocks. Assertion fail: /root/lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:358 #720

[Bug] No enough blocks. Assertion fail: /root/lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:358 #720

dawnranger commented Nov 21, 2023 •

edited

Loading

lvhan028 commented Nov 21, 2023

dawnranger commented Nov 21, 2023

lvhan028 commented Nov 21, 2023

lvhan028 commented Nov 21, 2023

lzhangzz commented Nov 21, 2023

dawnranger commented Nov 24, 2023

[Bug] No enough blocks. Assertion fail: /root/lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:358 #720

[Bug] No enough blocks. Assertion fail: /root/lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:358 #720

Comments

dawnranger commented Nov 21, 2023 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

Error traceback

lvhan028 commented Nov 21, 2023

dawnranger commented Nov 21, 2023

lvhan028 commented Nov 21, 2023

lvhan028 commented Nov 21, 2023

lzhangzz commented Nov 21, 2023

dawnranger commented Nov 24, 2023

dawnranger commented Nov 21, 2023 •

edited

Loading