Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/extract_pdf_page_images #3299

Open
huanji1987 opened this issue Jun 25, 2024 · 1 comment
Open

feat/extract_pdf_page_images #3299

huanji1987 opened this issue Jun 25, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@huanji1987
Copy link

huanji1987 commented Jun 25, 2024

Is your feature request related to a problem? Please describe.
Currently I'm working on a project that makes use of partition_pdf with hi_res strategy. Along with this the project also requires extracting each page of the pdf as an image. I see that here in the code that partition_pdf with hi_res will eventually hit, the pdf images per page is already being extracted with pdf2image. Instead of extracting the page images separately it would be ideal to be able to make use of these temporary images that are discarded after the with block.

Describe the solution you'd like
Ideally the partition_pdf function would have an option to extract_pdf_page_images. When this option is True, instead of using tempfile.TemporaryDirectory() to create a temporary directory for the images, the images would be returned in the response in some way to be available for use.

Describe alternatives you've considered
Alternatively I could look do the following:

  1. Do this separately and just eat the double work. Unfortunately pdf2image can be quite slow
  2. Monkeypatch the code, this is a good temporary fix but would likely require locking the version of unstructured used and is not a viable long term strategy
  3. Branch unstructured code and implement a fix for my use, similar to monkey patching not a viable long term strategy.
  4. Use another library that's not pdf2image that is faster so double work is no big deal, this has been explored and is not viable for various reasons.

Additional context
I want to add that I am happy to create a pull request myself for this feature. Mostly just curious about people's thoughts on this and thoughts on the right approach for this if I were to create a PR.

@huanji1987 huanji1987 added the enhancement New feature or request label Jun 25, 2024
@gecBurton
Copy link

this issue is duplicated:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants