adding `sub_element_positions` as discussed #116

hutchhicken · 2025-07-03T22:48:09Z

gabriel-piles · 2025-07-04T09:06:40Z

We truly appreciate your contribution to the project! So we can best leverage your insights, would you mind telling us about your background and how you're using the project? This will help us greatly in deciding how to improve it with your feedback.

hutchhicken · 2025-07-04T13:48:28Z

I have what is probably a fairly unique use case. I already have a working LLM / poppler-based text and image extraction workflow without a visual model.

Many of the PDFs I'm targeting though have been flattened, fragmenting images when transparencies are collapsed. Other PDFs have a composite visual background with text overlaid. For these reasons and others, the images I can just extract from the PDF directly using poppler are not always good enough.

I am using this project to 1) determine picture coordinates I can then use to extract images from a full jpeg rip of a PDF page and 2) determine article headers on pages that begin PDF articles to generate a link on the page that can be used to open a corresponding responsive view of the article.

My original process translates equations to MathML, so I modify the project's container to skip the otherwise-costly process of generating LaTeX. (I also skip OCR for similar reasons.)

I found that the FastAPI gateway quickly gets overwhelmed with concurrent jobs. So instead I have created a batch workflow (currently spun up as GCP Cloud Run jobs) that grabs and analyzes a single PDF, uploads the JSON to a separate endpoint, then terminates.

gabriel-piles · 2025-07-04T13:59:59Z

That's really interesting, thanks for sharing!

We're about to start working on a PDF to Markdown feature for this service. We'll need a few more months to have something usable, but perhaps this could help you with your processes down the line.

hutchhicken · 2025-07-07T12:57:23Z

Definitely would. My separate process generates simplified HTML (basically commonmark+). I tried many other approaches. But to create an article that preserves reading order and outputs something human acceptable, I ended up needing a "thinking" LLM. Maybe your mileage will vary as some of these other models mature.

adding sub_element_positions as discussed at huridocs#113

c9c860c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

adding `sub_element_positions` as discussed #116

adding `sub_element_positions` as discussed #116

Uh oh!

hutchhicken commented Jul 3, 2025

Uh oh!

gabriel-piles commented Jul 4, 2025

Uh oh!

hutchhicken commented Jul 4, 2025

Uh oh!

gabriel-piles commented Jul 4, 2025

Uh oh!

hutchhicken commented Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

adding sub_element_positions as discussed #116

Are you sure you want to change the base?

adding sub_element_positions as discussed #116

Uh oh!

Conversation

hutchhicken commented Jul 3, 2025

Uh oh!

gabriel-piles commented Jul 4, 2025

Uh oh!

hutchhicken commented Jul 4, 2025

Uh oh!

gabriel-piles commented Jul 4, 2025

Uh oh!

hutchhicken commented Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adding `sub_element_positions` as discussed #116

adding `sub_element_positions` as discussed #116