Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get images and text in order as in PDF? #705

Open
salmanulfaris opened this issue Apr 29, 2024 · 5 comments
Open

How to get images and text in order as in PDF? #705

salmanulfaris opened this issue Apr 29, 2024 · 5 comments
Labels

Comments

@salmanulfaris
Copy link

  • PDFParser Version: 2.9

Description:

I want to extract the PDF then save text to db and image to storage, but the order matters, if i take page 1, when i get an image, i need to get text coming after that.

PDF input

PDF containing some text then images in each pages,

Expected output & actual output

I need to extract the image and text in order as in the PDF
How to do That ?

Code

Code I'm using for extracting the image, but text is not available here

$parser = new Parser();
$pdf = $parser->parseFile(public_path('paper.pdf'));
$objects = $pdf->getObjects();
foreach ($objects as $key => $object) {
      echo '<img src="data:image/jpg;base64,'. base64_encode($object->getContent()) .'" />';
}
@k00ni k00ni changed the title Help ! How to get images and text in order as in PDF ? How to get images and text in order as in PDF? May 3, 2024
@k00ni k00ni added the question label May 3, 2024
@k00ni
Copy link
Collaborator

k00ni commented May 3, 2024

Without further investigation I don't think that is possible.

@azwhale
Copy link

azwhale commented May 13, 2024

you can use as blow

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('./test.pdf');
$objects = $pdf->getObjects();
$html = "<html><body>";


foreach ($objects as $key => $object) {
    if($object instanceof Smalot\PdfParser\XObject\Image ){
        $image = $object->getContent();
        $html .= "<img src='data:image/jpeg;base64," . base64_encode($image) . "' />";
    }else{
        $text =  $object->getText();
        $html .= "<div>{$text}</div>";
    }
}
$html .= "</body></html>";
file_put_contents('./test_to_html.html', $html);

@k00ni
Copy link
Collaborator

k00ni commented May 13, 2024

Careful here. There are objects of other types as well, so your else-part is likely to run into an error. Also, Document::getObjects might not return an ordered list. You shouldn't rely on the fact that PDFParser added objects in the same order as they appear while parsing the PDF.

Instead, you could iterate over all pages ($pdf::getPages()) and see, if you can get images and texts from them (check Page::getText and Page::getXObjects). Might worth a try.

@salmanulfaris
Copy link
Author

We can handle those errors, but order of the objects is very important for me, I'm scrapping PDF which is answer key of an exam, I want fetch the questions and answers from the PDF and store to DB, so Questions and options may be either text or image, so I need identify questions and it's answers from sequence of Objects

Here I'm attaching sample document
Example Document.pdf

@huelsgp27

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants