Possibility to improve parsing perfomance by not using uniqid() and add an early return? #712

bernemann · 2024-05-15T10:11:23Z

I parse PDFs where strings are encoded in the bracket notation like

[(Xxxx)1.7(a)1.2(tt )-10.1(\374)1.1(be)1.4(r )-8.5(Xxxx)-7.8(xx)-9.5(xxx)2.2(xxx)-1.1(x)-9(xxx)2( )]TJ

I have an idea to improve parsing performance which reduced parsing time from 1.65 seconds to 0.54 seconds in my example:

First, I observed a significant performance increase when calling Page::getTextArray() on documents with strings like this with a simpler placeholder in PDFObject::formatContent() like

$id = "S_$i"; where $i is a simple counting integer instead of $id = uniqid('STRING_', true);

Reason for this are the latter numerous calls to str_replace() in order to replace back the placeholders with the original content. The complex placeholder generated by uniqid() seems to slow down the calls.

Is there any reason for using this complex placeholder or can it be replaced by a much simpler one?

Second, I recognized 2 (more or less) subsequent calls to PDFObject->getTextArray($this) in Page::getTextArray() around line where the first one might immediatlly return:

pdfparser/src/Smalot/PdfParser/Page.php

Lines 346 to 350 in a19d555

    
           try { 
        
               $contents->getTextArray($this); 
        
           } catch (\Throwable $e) { 
        
               return $contents->getTextArray(); 
        
           }

Is there a reason for not returning in the try case and instead return the same call later

pdfparser/src/Smalot/PdfParser/Page.php

Line 365 in a19d555

return $contents->getTextArray($this);

?

Thank you for your time!

The text was updated successfully, but these errors were encountered:

k00ni · 2024-05-15T11:17:22Z

Sounds reasonable and interesting so please create a pull request and lets discuss there.

Is there any reason for using this complex placeholder or can it be replaced by a much simpler one?

Without a deeper look into the code, it doesn't seem necessary to use an expensive function such as uniqid() here. Instead an incremented number seems to be enough.

Second, I recognized 2 (more or less) subsequent calls to ...

Because I don't know where we go with the first point, please use a separate PR for the second one. At least for now.

GreyWyvern · 2024-05-15T14:40:20Z

The reason for the Unique ID is that the simpler the replacement string, the higher the chance it may appear elsewhere in the document and cause the wrong text to be restored later. For example, @@@S_2@@@ has a higher chance of appearing as regular document stream text than @@@STRING_b64b8a7ad6d4c@@@ etc. Maybe not all that much higher, but still higher.

That being said, if the uniqid() calls can be replaced with a system that uses less CPU, but still keeps the complexity of the replacement string, I'm all for that.

Edit: Wait, you're saying the str_replace() call later is what's causing the slowdown, because the search string is so long? Not uniqid()?

bernemann · 2024-05-15T16:02:04Z

The reason for the Unique ID is that the simpler the replacement string, the higher the chance it may appear elsewhere in the document and cause the wrong text to be restored later. For example, @@@S_2@@@ has a higher chance of appearing as regular document stream text than @@@STRING_b64b8a7ad6d4c@@@ etc. Maybe not all that much higher, but still higher.

That being said, if the uniqid() calls can be replaced with a system that uses less CPU, but still keeps the complexity of the replacement string, I'm all for that.

Edit: Wait, you're saying the str_replace() call later is what's causing the slowdown, because the search string is so long? Not uniqid()?

Exactly, str_replace() is taking up to 75% of the parsing time in my profiling example because of the complexity of the search pattern.

bernemann · 2024-05-15T18:58:58Z

Understood the increased likelihood of unintended text replaces. What do you think about using preg_replace_callback() like

$content = preg_replace_callback("/@@@STRING_.*?@@@/",
    fn ($match) => $pdfstrings[substr($match[0], 3, -3)],
    $content
);

paired with more_entropy = false to slightly reduce complexity in

$dictid = uniqid('STRING_', false);

Seems to result in comparable performance gains.

k00ni · 2024-05-16T06:43:24Z

uniqid is only used in 3 places (all inside PDFObject).

$id = uniqid('IMAGE_', true);
$id = uniqid('STRING_', true);
$dictid = uniqid('DICT_', true);

The following code should produce a random-enough string but with lesser CPU usage (not benchmarked though!):

$hash = hash('md2', rand());
$id = 'IMAGE_'.substr($hash, 0, 7);

md2 is an old hash function with a couple of disadvantages (e.g. collisions, source), but it should be "secure" enough for our use case.

The code could be even simpler if an incremented integer value is used instead of rand(). If the value is globally unique, rand only adds further CPU usage without more uniqueness, doesn't it?

So we would end up with something like:

private function formatContent(?string $content): string
{
    $elementId = 0;

    // ...

    $hash = hash('md2', ++$elementId);
    // maybe length (5) can be even shorter?
    $id = 'IMAGE_'.substr($hash, 0, 5);

    // ...

}

Btw. uniqid has its problems too, as described here: https://www.php.net/manual/de/function.uniqid.php#120123

<?php
for($i=0;$i<20;$i++) {
    echo uniqid(), PHP_EOL;
}
?>

Output:

5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
...

Besides replacing uniqid, would it help to add a suffix, so the hash can be shorter?

$hash = hash('md2', rand());
$id = 'IMAGE_'.substr($hash, 0, 7).'_IMAGE_END';

What do you think about using preg_replace_callback() like ... paired with more_entropy = false to slightly reduce complexity in

I assume the reduction is because of more_entropy = false? That could be an option too, if we don't find a replacement at all. Please explain the preg_replace_callback part, I don't know what its benefits are here.

Chandlr · 2024-05-16T09:04:12Z

Why use big function like a Message Digest hash when there's ElfHash which is faster than CRC32 from what i have heard.
example:

    public function ELFHash($string, $len = null)
    {
        $hash = 0;
        $len || $len = strlen($string);
        for ($i=0; $i<$len; $i++) {
            $hash = ($hash << 4) + ord($string[$i]); $x = $hash & 0xF0000000; if ($x != 0) { $hash ^= ($x >> 24);
            }
            $hash &= ~$x;
        }
        return ($hash % 0xFFFFFFFF) & 0xFFFFFFFF;
    }

New to php and i found this:

<?php
$h = hash("xxh3", $data, options: ["seed" => 42]);
echo $h, "\n";
?>

bernemann · 2024-05-16T10:01:14Z

The performance loss is not caused by calling uniqid() (or any other hash-like function) but by the subsequent replacements using str_replace() on the whole content:

pdfparser/src/Smalot/PdfParser/PDFObject.php

Lines 384 to 398 in a19d555

    
           // Restore the original string content 
        
           $pdfstrings = array_reverse($pdfstrings, true); 
        
           foreach ($pdfstrings as $id => $text) { 
        
               // Strings may contain escaped newlines, or literal newlines 
        
               // and we should clean these up before replacing the string 
        
               // back into the content stream; this ensures no strings are 
        
               // split between two lines (every command must be on one line) 
        
               $text = str_replace( 
        
                   ["\\\r\n", "\\\r", "\\\n", "\r", "\n"], 
        
                   ['', '', '', '\r', '\n'], 
        
                   $text 
        
               ); 
        
               $content = str_replace('@@@'.$id.'@@@', $text, $content); 
        
           }

The call to str_replace('@@@'.$id.'@@@', $text, $content) inside the loop for every replacement scans $content for a rather complex placeholder like @@@STRING_blablablablabla.blablablablabl@@@ produced by earlier calls to uniqid().

My first idea was to simplify the placeholder which lead to a huge performance gain in my case with the downside of an increased likelihood of unintended replacements (@@@S_1@@@ has much higher chance of beeing real PDF content than @@@STRING_blablablablabla.blablablablabl@@@).

Second idea now is to slightly reduce complexity in the placeholder with more_entropy=false combined with above mentioned regular expression callback preg_replace_callback().

I will do a few tests, compare performance and make a suggestion.

GreyWyvern · 2024-05-16T14:26:24Z

You might want to benchmark using preg_replace() with a limit of 1 instead of str_replace() which can't be limited. The preg_replace() call would quit right after the first match, while the str_replace() would continue to search through the whole document. That might save some processing?

Maybe even strtr() would be faster...

k00ni added the enhancement label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility to improve parsing perfomance by not using uniqid() and add an early return? #712

Possibility to improve parsing perfomance by not using uniqid() and add an early return? #712

bernemann commented May 15, 2024

k00ni commented May 15, 2024 •

edited

Loading

GreyWyvern commented May 15, 2024 •

edited

Loading

bernemann commented May 15, 2024

bernemann commented May 15, 2024

k00ni commented May 16, 2024 •

edited

Loading

Chandlr commented May 16, 2024 •

edited

Loading

bernemann commented May 16, 2024

GreyWyvern commented May 16, 2024

Possibility to improve parsing perfomance by not using uniqid() and add an early return? #712

Possibility to improve parsing perfomance by not using uniqid() and add an early return? #712

Comments

bernemann commented May 15, 2024

k00ni commented May 15, 2024 • edited Loading

GreyWyvern commented May 15, 2024 • edited Loading

bernemann commented May 15, 2024

bernemann commented May 15, 2024

k00ni commented May 16, 2024 • edited Loading

Chandlr commented May 16, 2024 • edited Loading

bernemann commented May 16, 2024

GreyWyvern commented May 16, 2024

k00ni commented May 15, 2024 •

edited

Loading

GreyWyvern commented May 15, 2024 •

edited

Loading

k00ni commented May 16, 2024 •

edited

Loading

Chandlr commented May 16, 2024 •

edited

Loading