Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the time to generate the OCR field #30

Merged
merged 1 commit into from
Feb 26, 2025

Conversation

liseli
Copy link
Contributor

@liseli liseli commented Feb 25, 2025

The following changes were applied to this PR to reduce the time to generate the OCR field.

  • Use a list to collect the text, instead of string concatenation
  • Remove redundant encoding and decoding operations
  • Use content management to open the zip file.
  • Remove some redundant logs

In the following picture, you can see how most of the time, the full-text document is in the process of generating the ORC field.

image

I used seven documents to measure the time to generate the OCR field. Before this PR, around 0.38 seconds to generate the OCR field, after the PR ~0.16 seconds.

Steps to test this PR,

docker build -t document_generator .
docker compose up document_generator -d
docker compose exec document_generator pytest document_generator ht_document ht_queue_service ht_utils

Copy link
Member

@aelkiss aelkiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. It looks like before we were copying and appending to the string which would be n^2 in the number of pages, but now we are appending each page data to an array and concatenating once into a string.

@liseli liseli merged commit 40f55b2 into main Feb 26, 2025
1 check passed
@liseli liseli deleted the DEV-1565-optimize_OCR_generation branch February 26, 2025 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants