-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicit Encoding Handling for PDF Parsing #8905
Comments
From my research into how both PyPDF and PDFMiner handle text extraction for #8491, I’ve found that the presence of For PDFMiner, the Similarly, with PyPDF, when encountering issues with certain fonts not being extracted properly, it could be due to the absence of a translation table (such as the So, while we may be unable to correct the poorly extracted text, we can do a post-conversion cleanup to ensure the fallback behavior of the converters producing the raw character IDs is removed, preventing these artifacts from affecting downstream tasks (although this could result in some data loss depending on how much it fallbacks). One approach could be to use the DocumentCleaner component's |
More context about the We load our PDF files using byte loading (e.g Looking more into the stack overflow link it appears that the issue is with the PDF itself as Ibux mentions. E.g. The PDF could be using a custom font that the extractor doesn't know how to convert since that information is missing in the PDF. Quote from one of the comments in the stack overflow post
|
Also relevant is PDFMiner's documentation for cid:x. For PyPDF issues, it seems like the project recommends people to create an issue if you can copy from the PDF into a destination and have no issues with that (as this would likely mean that the issue is not with the PDF and instead PyPDF). py-pdf/pypdf#2295 |
Is your feature request related to a problem? Please describe.
PDFs with non-UTF-8 encoding (e.g., ANSI, cp1252) are not indexed correctly in Haystack’s document pipeline. This results in missing text, corrupted characters (e.g., (cid:xx) artifacts), or unreadable embeddings. I request an enhancement to support automatic encoding detection and conversion in the Haystack PDF parsing component and explicit encoding selection options.
Describe the solution you'd like
Enhance the PDF parsing components by:
Auto-detecting encoding before indexing using libraries like chardet or cchardet.
Providing an explicit encoding parameter (e.g., encoding="utf-8" or encoding="auto") in PDFToTextConverter, PDFPlumberConverter, and PyMuPDFConverter.
Converting extracted text to UTF-8 before it is passed to the embedding pipeline.
The text was updated successfully, but these errors were encountered: