How to reduce the file size of the extracted html? #1554

DipanshuJuneja · 2022-01-23T07:09:34Z

DipanshuJuneja
Jan 23, 2022

The quality of the extracted html output for PyMuPDF is far better than what I was getting using some of the other libraries like PDBox wrapper for python. However, one concern I have is regarding the output file size which is quite larger (1.5 MB) as compared to the other option (400 KB). I am using the flag to skip images using not fitz.TEXT_PRESERVE_IMAGES . Apart from this, how can I further reduce the size of the output html file? I'm looking for minified versions of the html code. Thanks. I want to preserve the whitespaces if possibly since the PDF contains a few tables as well.

JorjMcKie · 2022-01-23T09:14:26Z

JorjMcKie
Jan 23, 2022
Maintainer

This is a thin wrapper of an original MuPDF function. So there is no way for me to influence the output, sorry.
Maybe there are postprocessors on the market, that offer syntax optimizations (tidy?), but I really don't know much about this area.

1 reply

DipanshuJuneja Jan 23, 2022
Author

Sure, thanks @JorjMcKie

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reduce the file size of the extracted html? #1554

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to reduce the file size of the extracted html? #1554

DipanshuJuneja Jan 23, 2022

Replies: 1 comment · 1 reply

JorjMcKie Jan 23, 2022 Maintainer

DipanshuJuneja Jan 23, 2022 Author

DipanshuJuneja
Jan 23, 2022

Replies: 1 comment 1 reply

JorjMcKie
Jan 23, 2022
Maintainer

DipanshuJuneja Jan 23, 2022
Author