Reduce the time to generate the OCR field #30

liseli · 2025-02-25T20:15:31Z

The following changes were applied to this PR to reduce the time to generate the OCR field.

Use a list to collect the text, instead of string concatenation
Remove redundant encoding and decoding operations
Use content management to open the zip file.
Remove some redundant logs

In the following picture, you can see how most of the time, the full-text document is in the process of generating the ORC field.

I used seven documents to measure the time to generate the OCR field. Before this PR, around 0.38 seconds to generate the OCR field, after the PR ~0.16 seconds.

Steps to test this PR,

docker build -t document_generator .
docker compose up document_generator -d
docker compose exec document_generator pytest document_generator ht_document ht_queue_service ht_utils

aelkiss

Makes sense to me. It looks like before we were copying and appending to the string which would be n^2 in the number of pages, but now we are appending each page data to an array and concatenating once into a string.

Reduce the time to generate the OCR field

359e8a4

liseli requested review from aelkiss and Ronster2018 February 25, 2025 20:21

aelkiss approved these changes Feb 26, 2025

View reviewed changes

liseli merged commit 40f55b2 into main Feb 26, 2025
1 check passed

liseli deleted the DEV-1565-optimize_OCR_generation branch February 26, 2025 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the time to generate the OCR field #30

Reduce the time to generate the OCR field #30

liseli commented Feb 25, 2025 •

edited

Loading

aelkiss left a comment

Reduce the time to generate the OCR field #30

Reduce the time to generate the OCR field #30

Conversation

liseli commented Feb 25, 2025 • edited Loading

aelkiss left a comment

Choose a reason for hiding this comment

liseli commented Feb 25, 2025 •

edited

Loading