How to get images and text in order as in PDF? #705

salmanulfaris · 2024-04-29T09:24:01Z

PDFParser Version: 2.9

Description:

I want to extract the PDF then save text to db and image to storage, but the order matters, if i take page 1, when i get an image, i need to get text coming after that.

PDF input

PDF containing some text then images in each pages,

Expected output & actual output

I need to extract the image and text in order as in the PDF
How to do That ?

Code

Code I'm using for extracting the image, but text is not available here

$parser = new Parser();
$pdf = $parser->parseFile(public_path('paper.pdf'));
$objects = $pdf->getObjects();
foreach ($objects as $key => $object) {
      echo '<img src="data:image/jpg;base64,'. base64_encode($object->getContent()) .'" />';
}

The text was updated successfully, but these errors were encountered:

k00ni · 2024-05-03T07:13:50Z

Without further investigation I don't think that is possible.

azwhale · 2024-05-13T03:27:54Z

you can use as blow

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('./test.pdf');
$objects = $pdf->getObjects();
$html = "<html><body>";


foreach ($objects as $key => $object) {
    if($object instanceof Smalot\PdfParser\XObject\Image ){
        $image = $object->getContent();
        $html .= "<img src='data:image/jpeg;base64," . base64_encode($image) . "' />";
    }else{
        $text =  $object->getText();
        $html .= "<div>{$text}</div>";
    }
}
$html .= "</body></html>";
file_put_contents('./test_to_html.html', $html);

k00ni · 2024-05-13T06:38:29Z

Careful here. There are objects of other types as well, so your else-part is likely to run into an error. Also, Document::getObjects might not return an ordered list. You shouldn't rely on the fact that PDFParser added objects in the same order as they appear while parsing the PDF.

Instead, you could iterate over all pages ($pdf::getPages()) and see, if you can get images and texts from them (check Page::getText and Page::getXObjects). Might worth a try.

salmanulfaris · 2024-05-13T08:35:08Z

We can handle those errors, but order of the objects is very important for me, I'm scrapping PDF which is answer key of an exam, I want fetch the questions and answers from the PDF and store to DB, so Questions and options may be either text or image, so I need identify questions and it's answers from sequence of Objects

Here I'm attaching sample document
Example Document.pdf

k00ni changed the title ~~Help ! How to get images and text in order as in PDF ?~~ How to get images and text in order as in PDF? May 3, 2024

k00ni added the question label May 3, 2024

This comment was marked as off-topic.

Sign in to view

k00ni mentioned this issue Nov 14, 2024

Extracting graphics from a PDF #747

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get images and text in order as in PDF? #705

How to get images and text in order as in PDF? #705

salmanulfaris commented Apr 29, 2024

k00ni commented May 3, 2024

azwhale commented May 13, 2024 •

edited

Loading

k00ni commented May 13, 2024 •

edited

Loading

salmanulfaris commented May 13, 2024

This comment was marked as off-topic.

How to get images and text in order as in PDF? #705

How to get images and text in order as in PDF? #705

Comments

salmanulfaris commented Apr 29, 2024

Description:

PDF input

Expected output & actual output

Code

k00ni commented May 3, 2024

azwhale commented May 13, 2024 • edited Loading

k00ni commented May 13, 2024 • edited Loading

salmanulfaris commented May 13, 2024

This comment was marked as off-topic.

azwhale commented May 13, 2024 •

edited

Loading

k00ni commented May 13, 2024 •

edited

Loading