Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata content garbled for some PDFs #730

Open
rdmpage opened this issue Aug 1, 2024 · 0 comments
Open

Metadata content garbled for some PDFs #730

rdmpage opened this issue Aug 1, 2024 · 0 comments
Labels

Comments

@rdmpage
Copy link

rdmpage commented Aug 1, 2024

  • PHP Version: 7.4.33
  • PDFParser Version: v2.10.0

Description:

For some PDFs (e.g., attached) the metadata is garbled. This seems to be associated with PDF's that are encrypted, but I don't know enough about the PDF standard to know whether encryption also applies to metadata.

PDF input

TZ_316_4_Gorochov.pdf

Expected output & actual output

Output from mutool is what I expect, e.g. Title is SYSTEMATICS OF THE AMERICAN KATYDIDS \(ORTHOPTERA: TETTIGONIIDAE\). COMMUNICATION 2:

mutool info TZ_316_4_Gorochov.pdf
TZ_316_4_Gorochov.pdf:

PDF-1.6
Info object (68 0 R):
<</CreationDate(D:20121225141316+04'00')/Author(A.V. Gorochov)/Creator(PScript5.dll Version 5.2.2)/Producer(Acrobat Distiller 9.5.2 \(Windows\))/ModDate(D:20121225161815+04'00')/Title(SYSTEMATICS OF THE AMERICAN KATYDIDS \(ORTHOPTERA: TETTIGONIIDAE\). COMMUNICATION 2)>>
Encryption object (70 0 R):
<</Length 128/Filter/Standard/O<0EBA1908E5CD53B188213637794EA65838027C93E38494B55544F4375B294C90>/P -1036/R 3/U<8049AC430DA9683FBBC0F5C6392E856600000000000000000000000000000000>/V 2>>
Pages: 22
...

What I get from PdfParser is the following:

*** Metadata ***
Array
(
    [CreationDate] => CŠtW“Ò˙Mð,¯š Wgá3agí
ÂQ©wèAuthor] => F…Iœ§E
    [Creator] => Wþ%Õ’^J…Vt¾øt?[ºzqbäÿ#i
    [Producer] => FÎÞ†^_é[²l÷Â}>ì:dyøí%
                                       a¤»fi²å
    [ModDate] => CŠtW“Ò˙Mð.¯Œ Tgá3agí
¨wèu³pô.@‘Ïˇ{@[òÜ¡ÐèU^éÛ3x=؈"¬OÔLŽOˆFêfl½‚,‹'f	H‚6
    [Pages] => 22
)

Code

<?php

// Example of PDF with bad characters

require_once (dirname(__FILE__) . '/vendor/autoload.php');

$filename = 'TZ_316_4_Gorochov.pdf';

$parser_config = new \Smalot\PdfParser\Config();
$parser_config->setRetainImageContent(false);
$parser_config->setIgnoreEncryption(true);

$parser = new \Smalot\PdfParser\Parser([], $parser_config);

// parse PDF
$pdf = $parser->parseFile($filename);
	
// Metadata
if (method_exists($pdf, 'getDetails'))
{
	$metadata = $pdf->getDetails();

	echo "*** Metadata ***\n";
	print_r($metadata); 

}

?>

@k00ni k00ni added the bug label Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants