Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some PDF documents contain text that I can't extract #750

Open
cyril-fsm opened this issue Dec 5, 2024 · 3 comments
Open

Some PDF documents contain text that I can't extract #750

cyril-fsm opened this issue Dec 5, 2024 · 3 comments
Labels

Comments

@cyril-fsm
Copy link

  • PHP Version: 7.4.11
  • PDFParser Version: 2.11

Description:

Hello,
Some pdf documents seem impossible to parse (text extraction). What is the reason and what can I do? Thank you for your help.

PDF input

https://telechargements.soludedia.fr/divers/LFMF.pdf

Expected output & actual output

Code

if($f = file_get_contents('file.pdf')){
	
	include "./inc/pdfparser-2024/alt_autoload.php";
	$parser = new \Smalot\PdfParser\Parser();
	$document = $parser->parseContent($f);
	$pages    = $document->getPages();
	$page     = $pages[0];
	$content  = $page->getText();
		
	echo '<pre>' . $content . '</pre>';
}
@k00ni
Copy link
Collaborator

k00ni commented Dec 6, 2024

What do you mean with impossible to parse (text extraction)? Please be more specific.

@k00ni k00ni added the bug label Dec 6, 2024
@cyril-fsm
Copy link
Author

Hello, I get a sequence of characters unrelated to the text of the pdf.
Instead of:

ATTERRISSAGE A VUE
Visual landing

FAYENCE
AD 2 LFMF ATT 01

Ouvert à la CAP
Public air traffic
30 NOV 23

ALT AD : 738 (27 hPa)
LFMF
LAT : 43 36 29 N
VAR : 2° E (20)
LONG : 006 42 06 E
...

I get

��
���
�� !"#�$"#
���������
�%&'(!
 ���
�$"�(!)*&'!
������
��������
������
������
�
�+#�#&%",!
 �������
�-�� .�� ���
������ ������������	��������� �������
��������������
��������
�� ����� ���
/%0%(1/%0%(1
���2�32���2�32
4!� 5!( !
-&�24$1
4!� 5!( !
-&�24$1
4!��4%"
�2#&
4!��4%"
�2#&
4!��4%"
�.'!#&
4!��4%"
�.'!#&
4!�6$'((!
�7'
6%#8'!
4!�6$'((!
�7'
6%#8'!
-&�24$1-&�24$1
4!#
�'%&(!�35! �"#
4!#
�'%&(!�35! �"#
4!#
34$&#
4!#
34$&#
4!#
3($'�#
4!#
3($'�#
�$ 9%(7�!�$ 9%(7�!
4!#
6'(!#
4!#
6'(!#
4!
35!)%4�!(
4!
35!)%4�!(
...

@netants2015
Copy link

I get same error. There are text boxes in the document. The content in the text boxes is in Chinese, but it is garbled when read. The Chinese in other parts is displayed normally.

^ g*yÑb€g –PQlSø 4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants