doc_analyze中按页串行处理的逻辑改成并行提升速度 #1566

LCorleone · 2025-01-17T08:57:05Z

for index in range(len(dataset)):
    page_data = dataset.get_page(index)
    img_dict = page_data.get_image()
    img = img_dict['img']
    page_width = img_dict['width']
    page_height = img_dict['height']
    if start_page_id <= index <= end_page_id:
        page_start = time.time()
        result = custom_model(img)
        logger.info(f'-----page_id : {index}, page total time: {round(time.time() - page_start, 2)}-----')
    else:
        result = []

    page_info = {'page_no': index, 'height': page_height, 'width': page_width}
    page_dict = {'layout_dets': result, 'page_info': page_info}
    model_json.append(page_dict)

看了下代码，每一页处理是独立的，对于一个页数较大的文件，会比较耗时，在资源允许的情况下，改成并行处理再合并对速度应该有较大提升。不知是否可行。

The text was updated successfully, but these errors were encountered:

myhloli · 2025-01-17T08:59:31Z

目前有做性能优化的计划，根据调研结果，并行处理在io资源上消耗较大，目前的优化方向是尽量吃满单卡性能用更大的batch去做加速。

LCorleone added the enhancement New feature or request label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc_analyze中按页串行处理的逻辑改成并行提升速度 #1566

doc_analyze中按页串行处理的逻辑改成并行提升速度 #1566

LCorleone commented Jan 17, 2025

myhloli commented Jan 17, 2025

doc_analyze中按页串行处理的逻辑改成并行提升速度 #1566

doc_analyze中按页串行处理的逻辑改成并行提升速度 #1566

Comments

LCorleone commented Jan 17, 2025

myhloli commented Jan 17, 2025