You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently I'm working on a project that makes use of partition_pdf with hi_res strategy. Along with this the project also requires extracting each page of the pdf as an image. I see that here in the code that partition_pdf with hi_res will eventually hit, the pdf images per page is already being extracted with pdf2image. Instead of extracting the page images separately it would be ideal to be able to make use of these temporary images that are discarded after the with block.
Describe the solution you'd like
Ideally the partition_pdf function would have an option to extract_pdf_page_images. When this option is True, instead of using tempfile.TemporaryDirectory() to create a temporary directory for the images, the images would be returned in the response in some way to be available for use.
Describe alternatives you've considered
Alternatively I could look do the following:
Do this separately and just eat the double work. Unfortunately pdf2image can be quite slow
Monkeypatch the code, this is a good temporary fix but would likely require locking the version of unstructured used and is not a viable long term strategy
Branch unstructured code and implement a fix for my use, similar to monkey patching not a viable long term strategy.
Use another library that's not pdf2image that is faster so double work is no big deal, this has been explored and is not viable for various reasons.
Additional context
I want to add that I am happy to create a pull request myself for this feature. Mostly just curious about people's thoughts on this and thoughts on the right approach for this if I were to create a PR.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Currently I'm working on a project that makes use of
partition_pdf
withhi_res
strategy. Along with this the project also requires extracting each page of the pdf as an image. I see that here in the code thatpartition_pdf
withhi_res
will eventually hit, the pdf images per page is already being extracted withpdf2image
. Instead of extracting the page images separately it would be ideal to be able to make use of these temporary images that are discarded after thewith
block.Describe the solution you'd like
Ideally the
partition_pdf
function would have an option toextract_pdf_page_images
. When this option isTrue
, instead of usingtempfile.TemporaryDirectory()
to create a temporary directory for the images, the images would be returned in the response in some way to be available for use.Describe alternatives you've considered
Alternatively I could look do the following:
pdf2image
can be quite slowunstructured
used and is not a viable long term strategypdf2image
that is faster so double work is no big deal, this has been explored and is not viable for various reasons.Additional context
I want to add that I am happy to create a pull request myself for this feature. Mostly just curious about people's thoughts on this and thoughts on the right approach for this if I were to create a PR.
The text was updated successfully, but these errors were encountered: