Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with simple PDFs #178

Open
MrCodingCoderCoding opened this issue Nov 4, 2024 · 0 comments
Open

Performance issue with simple PDFs #178

MrCodingCoderCoding opened this issue Nov 4, 2024 · 0 comments

Comments

@MrCodingCoderCoding
Copy link

Hello,
I’m reaching out to report a performance issue I’ve encountered since upgrading to the latest version of PyMuPDF4LLM. After the update, I’ve observed a noticeable decrease in efficiency when working with simple PDFs, whereas the previous version consistently handled similar tasks more smoothly.

Here are some specific issues I noticed when extracting content from the "markdown.pdf" file:

  • Bold and italicized text is misinterpreted.
  • The text "### This is a test with headlines and tables:" is not a headline but marked as one
  • Markdown formatting within tables does not function as expected (bold, italic, ... missing)

Additionally, I wanted to suggest a potential feature enhancement. It would be very useful to have built-in support for applying strikethrough text in the library. I have managed to implement a workaround, but native support for this feature in PyMuPDF4LLM would be a valuable addition.

Thank you for your time and for considering these improvements.


python: 3.11.9 / 2.7.18
pymupdf: ('1.24.11', '1.24.10', '20241003000001')
pymupdf4llm: 0.0.17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant