Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc file splitting into multiple image files #1777

Open
llmwesee opened this issue Jul 31, 2024 · 1 comment
Open

Doc file splitting into multiple image files #1777

llmwesee opened this issue Jul 31, 2024 · 1 comment

Comments

@llmwesee
Copy link

When uploading .doc or .docx files, the following warnings are displayed:
No acceptable contours found
Contour is not a quadrilateral
lib/python3.10/site-packages/langchain_core/_api/deprecation.py:139: LangChainDeprecationWarning: Since Chroma 0.4.x the manual persistence method is no longer supported as docs are automatically persisted. warn_deprecated(

After uploading, a single .doc file splits into around 59 documents in .png or .jpeg formats. These are shown in the Doc Counts sidebar and also form the metadata as illustrated in the attached screenshot.

Screenshot from 2024-07-31 10-02-10

The main issue is the inability of the .doc file to be parsed as a single document, unlike .pdf files. Instead, it splits into multiple .png or .jpeg files, leading to hallucination during querying.

Please address the parsing issue and provide a solution to handle .doc or .docx files correctly without splitting into multiple image files.

pseudotensor added a commit that referenced this issue Jul 31, 2024
@pseudotensor
Copy link
Collaborator

pseudotensor commented Jul 31, 2024

We extract images so that they can be processed separately for image question-answer, but I understand if there are many images it might get messy.

I added an ENV so you can control with H2OGPT_DOCX_EXTRACT_IMAGES. set it to "0" to avoid this step.

Docker with this feature will be in new build in few hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants