feat/extract_pdf_page_images #3299

huanji1987 · 2024-06-25T23:38:19Z

Is your feature request related to a problem? Please describe.
Currently I'm working on a project that makes use of partition_pdf with hi_res strategy. Along with this the project also requires extracting each page of the pdf as an image. I see that here in the code that partition_pdf with hi_res will eventually hit, the pdf images per page is already being extracted with pdf2image. Instead of extracting the page images separately it would be ideal to be able to make use of these temporary images that are discarded after the with block.

Describe the solution you'd like
Ideally the partition_pdf function would have an option to extract_pdf_page_images. When this option is True, instead of using tempfile.TemporaryDirectory() to create a temporary directory for the images, the images would be returned in the response in some way to be available for use.

Describe alternatives you've considered
Alternatively I could look do the following:

Do this separately and just eat the double work. Unfortunately pdf2image can be quite slow
Monkeypatch the code, this is a good temporary fix but would likely require locking the version of unstructured used and is not a viable long term strategy
Branch unstructured code and implement a fix for my use, similar to monkey patching not a viable long term strategy.
Use another library that's not pdf2image that is faster so double work is no big deal, this has been explored and is not viable for various reasons.

Additional context
I want to add that I am happy to create a pull request myself for this feature. Mostly just curious about people's thoughts on this and thoughts on the right approach for this if I were to create a PR.

The text was updated successfully, but these errors were encountered:

gecBurton · 2024-07-17T11:12:29Z

this issue is duplicated:

Lightweight installation unstructured[pdf] ????? #2976
CPU only installation #3326
and i suspect others too

huanji1987 added the enhancement New feature or request label Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/extract_pdf_page_images #3299

feat/extract_pdf_page_images #3299

huanji1987 commented Jun 25, 2024 •

edited

Loading

gecBurton commented Jul 17, 2024

feat/extract_pdf_page_images #3299

feat/extract_pdf_page_images #3299

Comments

huanji1987 commented Jun 25, 2024 • edited Loading

gecBurton commented Jul 17, 2024

huanji1987 commented Jun 25, 2024 •

edited

Loading