Skip to content

Conversation

@hutchhicken
Copy link

@gabriel-piles
Copy link
Member

We truly appreciate your contribution to the project! So we can best leverage your insights, would you mind telling us about your background and how you're using the project? This will help us greatly in deciding how to improve it with your feedback.

@hutchhicken
Copy link
Author

I have what is probably a fairly unique use case. I already have a working LLM / poppler-based text and image extraction workflow without a visual model.

Many of the PDFs I'm targeting though have been flattened, fragmenting images when transparencies are collapsed. Other PDFs have a composite visual background with text overlaid. For these reasons and others, the images I can just extract from the PDF directly using poppler are not always good enough.

I am using this project to 1) determine picture coordinates I can then use to extract images from a full jpeg rip of a PDF page and 2) determine article headers on pages that begin PDF articles to generate a link on the page that can be used to open a corresponding responsive view of the article.

My original process translates equations to MathML, so I modify the project's container to skip the otherwise-costly process of generating LaTeX. (I also skip OCR for similar reasons.)

I found that the FastAPI gateway quickly gets overwhelmed with concurrent jobs. So instead I have created a batch workflow (currently spun up as GCP Cloud Run jobs) that grabs and analyzes a single PDF, uploads the JSON to a separate endpoint, then terminates.

@gabriel-piles
Copy link
Member

That's really interesting, thanks for sharing!

We're about to start working on a PDF to Markdown feature for this service. We'll need a few more months to have something usable, but perhaps this could help you with your processes down the line.

@hutchhicken
Copy link
Author

Definitely would. My separate process generates simplified HTML (basically commonmark+). I tried many other approaches. But to create an article that preserves reading order and outputs something human acceptable, I ended up needing a "thinking" LLM. Maybe your mileage will vary as some of these other models mature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants