Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Split DeepDoc to an independent program #4642

Open
firezym opened this issue Jan 25, 2025 · 1 comment
Open

[Feature Request]: Split DeepDoc to an independent program #4642

firezym opened this issue Jan 25, 2025 · 1 comment
Labels

Comments

@firezym
Copy link

firezym commented Jan 25, 2025

I have tested and compared the DeepDoc module with [marker-pdf](https://github.com/VikParuchuri/marker), which has 20k stars and focuses solely on parsing PDFs. I think your OCR quality is better, especially for charts and tables. You could consider splitting the module out as a standalone program and offer chargeable online saas service for parsing PDFs into Markdown, similar to how [Jina Reader](https://jina.ai/reader/) converts HTML to Markdown.

A standalone service could also contribute to the RAGFlow agent modules, as users might sometimes need to transform PDFs into Markdown for further use with LLMs without necessarily storing them in the knowledge base.

@Snify89
Copy link

Snify89 commented Jan 25, 2025

It is already somewhat splitted.
The only issue is, (western) European characters are not applied (Umlaute, etc.)

@KevinHuSh KevinHuSh changed the title [Suggestion] Split DeepDoc to an independent program [Feature Request] Split DeepDoc to an independent program Jan 26, 2025
@KevinHuSh KevinHuSh changed the title [Feature Request] Split DeepDoc to an independent program [Feature Request]: Split DeepDoc to an independent program Jan 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants