Explicit Encoding Handling for PDF Parsing #8905

JasperLS · 2025-02-21T16:25:00Z

Is your feature request related to a problem? Please describe.
PDFs with non-UTF-8 encoding (e.g., ANSI, cp1252) are not indexed correctly in Haystack’s document pipeline. This results in missing text, corrupted characters (e.g., (cid:xx) artifacts), or unreadable embeddings. I request an enhancement to support automatic encoding detection and conversion in the Haystack PDF parsing component and explicit encoding selection options.

Describe the solution you'd like
Enhance the PDF parsing components by:
Auto-detecting encoding before indexing using libraries like chardet or cchardet.
Providing an explicit encoding parameter (e.g., encoding="utf-8" or encoding="auto") in PDFToTextConverter, PDFPlumberConverter, and PyMuPDFConverter.
Converting extracted text to UTF-8 before it is passed to the embedding pipeline.

lbux · 2025-02-22T06:24:00Z

From my research into how both PyPDF and PDFMiner handle text extraction for #8491, I’ve found that the presence of cid:x values often signals that the PDF itself is missing necessary character-to-Unicode mappings. This tends to happen when the fonts or character encodings in the PDF are incomplete or poorly defined. In this case the issue might not be with the extraction itself but with the underlying PDF structure.

For PDFMiner, the cid:x error usually happens because it defaults to showing raw character IDs when it cannot map characters to Unicode. This happens when the PDF uses fonts with no corresponding Unicode mapping (a common occurrence with custom or embedded fonts). If you open the PDF in a viewer, try copying the text and pasting it into a text editor. If it results in gibberish, that usually confirms that the issue lies within the PDF's encoding itself.

Similarly, with PyPDF, when encountering issues with certain fonts not being extracted properly, it could be due to the absence of a translation table (such as the /ToUnicode field for embedded fonts), which makes it difficult to decode the characters properly.

So, while we may be unable to correct the poorly extracted text, we can do a post-conversion cleanup to ensure the fallback behavior of the converters producing the raw character IDs is removed, preventing these artifacts from affecting downstream tasks (although this could result in some data loss depending on how much it fallbacks). One approach could be to use the DocumentCleaner component's remove_substrings or remove_regex to clean the unwanted patterns.

sjrl · 2025-02-24T08:58:30Z

More context about the cid:x in this stack overflow post https://stackoverflow.com/questions/66656067/replace-cidnumber-with-chars-using-python-when-extracting-text-from-pdf-fil

We load our PDF files using byte loading (e.g open(stream, "rb") as fh:) which means I don't believe the encoding parameter applies in this case.

Looking more into the stack overflow link it appears that the issue is with the PDF itself as Ibux mentions. E.g. The PDF could be using a custom font that the extractor doesn't know how to convert since that information is missing in the PDF. Quote from one of the comments in the stack overflow post

the PDF doesn't contain the information required for regular text extraction, so normal text extractors will fail

lbux · 2025-02-24T23:49:11Z

Also relevant is PDFMiner's documentation for cid:x. For PyPDF issues, it seems like the project recommends people to create an issue if you can copy from the PDF into a destination and have no issues with that (as this would likely mean that the issue is not with the PDF and instead PyPDF). py-pdf/pypdf#2295

julian-risch added P1 High priority, add to the next sprint type:feature New feature or request labels Feb 21, 2025

julian-risch assigned davidsbatista Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicit Encoding Handling for PDF Parsing #8905

Explicit Encoding Handling for PDF Parsing #8905

JasperLS commented Feb 21, 2025 •

edited

Loading

lbux commented Feb 22, 2025

sjrl commented Feb 24, 2025 •

edited

Loading

lbux commented Feb 24, 2025

Explicit Encoding Handling for PDF Parsing #8905

Explicit Encoding Handling for PDF Parsing #8905

Comments

JasperLS commented Feb 21, 2025 • edited Loading

lbux commented Feb 22, 2025

sjrl commented Feb 24, 2025 • edited Loading

lbux commented Feb 24, 2025

JasperLS commented Feb 21, 2025 •

edited

Loading

sjrl commented Feb 24, 2025 •

edited

Loading