Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to improve parsing perfomance by not using uniqid() and add an early return? #712

Open
bernemann opened this issue May 15, 2024 · 8 comments

Comments

@bernemann
Copy link

I parse PDFs where strings are encoded in the bracket notation like

[(Xxxx)1.7(a)1.2(tt )-10.1(\374)1.1(be)1.4(r )-8.5(Xxxx)-7.8(xx)-9.5(xxx)2.2(xxx)-1.1(x)-9(xxx)2( )]TJ

I have an idea to improve parsing performance which reduced parsing time from 1.65 seconds to 0.54 seconds in my example:

First, I observed a significant performance increase when calling Page::getTextArray() on documents with strings like this with a simpler placeholder in PDFObject::formatContent() like

$id = "S_$i"; where $i is a simple counting integer instead of $id = uniqid('STRING_', true);

Reason for this are the latter numerous calls to str_replace() in order to replace back the placeholders with the original content. The complex placeholder generated by uniqid() seems to slow down the calls.

Is there any reason for using this complex placeholder or can it be replaced by a much simpler one?

Second, I recognized 2 (more or less) subsequent calls to PDFObject->getTextArray($this) in Page::getTextArray() around line where the first one might immediatlly return:

try {
$contents->getTextArray($this);
} catch (\Throwable $e) {
return $contents->getTextArray();
}

Is there a reason for not returning in the try case and instead return the same call later

return $contents->getTextArray($this);

?

Thank you for your time!

@k00ni
Copy link
Collaborator

k00ni commented May 15, 2024

Sounds reasonable and interesting so please create a pull request and lets discuss there.

Is there any reason for using this complex placeholder or can it be replaced by a much simpler one?

Without a deeper look into the code, it doesn't seem necessary to use an expensive function such as uniqid() here. Instead an incremented number seems to be enough.

Second, I recognized 2 (more or less) subsequent calls to ...

Because I don't know where we go with the first point, please use a separate PR for the second one. At least for now.

@GreyWyvern
Copy link
Contributor

GreyWyvern commented May 15, 2024

The reason for the Unique ID is that the simpler the replacement string, the higher the chance it may appear elsewhere in the document and cause the wrong text to be restored later. For example, @@@S_2@@@ has a higher chance of appearing as regular document stream text than @@@STRING_b64b8a7ad6d4c@@@ etc. Maybe not all that much higher, but still higher.

That being said, if the uniqid() calls can be replaced with a system that uses less CPU, but still keeps the complexity of the replacement string, I'm all for that.

Edit: Wait, you're saying the str_replace() call later is what's causing the slowdown, because the search string is so long? Not uniqid()?

@bernemann
Copy link
Author

The reason for the Unique ID is that the simpler the replacement string, the higher the chance it may appear elsewhere in the document and cause the wrong text to be restored later. For example, @@@S_2@@@ has a higher chance of appearing as regular document stream text than @@@STRING_b64b8a7ad6d4c@@@ etc. Maybe not all that much higher, but still higher.

That being said, if the uniqid() calls can be replaced with a system that uses less CPU, but still keeps the complexity of the replacement string, I'm all for that.

Edit: Wait, you're saying the str_replace() call later is what's causing the slowdown, because the search string is so long? Not uniqid()?

Exactly, str_replace() is taking up to 75% of the parsing time in my profiling example because of the complexity of the search pattern.

@bernemann
Copy link
Author

Understood the increased likelihood of unintended text replaces. What do you think about using preg_replace_callback() like

$content = preg_replace_callback("/@@@STRING_.*?@@@/",
    fn ($match) => $pdfstrings[substr($match[0], 3, -3)],
    $content
);

paired with more_entropy = false to slightly reduce complexity in

$dictid = uniqid('STRING_', false);

Seems to result in comparable performance gains.

@k00ni
Copy link
Collaborator

k00ni commented May 16, 2024

uniqid is only used in 3 places (all inside PDFObject).

$id = uniqid('IMAGE_', true);
$id = uniqid('STRING_', true);
$dictid = uniqid('DICT_', true);

The following code should produce a random-enough string but with lesser CPU usage (not benchmarked though!):

$hash = hash('md2', rand());
$id = 'IMAGE_'.substr($hash, 0, 7);

md2 is an old hash function with a couple of disadvantages (e.g. collisions, source), but it should be "secure" enough for our use case.

The code could be even simpler if an incremented integer value is used instead of rand(). If the value is globally unique, rand only adds further CPU usage without more uniqueness, doesn't it?

So we would end up with something like:

private function formatContent(?string $content): string
{
    $elementId = 0;

    // ...

    $hash = hash('md2', ++$elementId);
    // maybe length (5) can be even shorter?
    $id = 'IMAGE_'.substr($hash, 0, 5);

    // ...

}

Btw. uniqid has its problems too, as described here: https://www.php.net/manual/de/function.uniqid.php#120123

<?php
for($i=0;$i<20;$i++) {
    echo uniqid(), PHP_EOL;
}
?>

Output:

5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
...

Besides replacing uniqid, would it help to add a suffix, so the hash can be shorter?

$hash = hash('md2', rand());
$id = 'IMAGE_'.substr($hash, 0, 7).'_IMAGE_END';

What do you think about using preg_replace_callback() like ... paired with more_entropy = false to slightly reduce complexity in

I assume the reduction is because of more_entropy = false? That could be an option too, if we don't find a replacement at all. Please explain the preg_replace_callback part, I don't know what its benefits are here.

@Chandlr
Copy link

Chandlr commented May 16, 2024

Why use big function like a Message Digest hash when there's ElfHash which is faster than CRC32 from what i have heard.
example:

    public function ELFHash($string, $len = null)
    {
        $hash = 0;
        $len || $len = strlen($string);
        for ($i=0; $i<$len; $i++) {
            $hash = ($hash << 4) + ord($string[$i]); $x = $hash & 0xF0000000; if ($x != 0) { $hash ^= ($x >> 24);
            }
            $hash &= ~$x;
        }
        return ($hash % 0xFFFFFFFF) & 0xFFFFFFFF;
    }

New to php and i found this:

<?php
$h = hash("xxh3", $data, options: ["seed" => 42]);
echo $h, "\n";
?>

@bernemann
Copy link
Author

The performance loss is not caused by calling uniqid() (or any other hash-like function) but by the subsequent replacements using str_replace() on the whole content:

// Restore the original string content
$pdfstrings = array_reverse($pdfstrings, true);
foreach ($pdfstrings as $id => $text) {
// Strings may contain escaped newlines, or literal newlines
// and we should clean these up before replacing the string
// back into the content stream; this ensures no strings are
// split between two lines (every command must be on one line)
$text = str_replace(
["\\\r\n", "\\\r", "\\\n", "\r", "\n"],
['', '', '', '\r', '\n'],
$text
);
$content = str_replace('@@@'.$id.'@@@', $text, $content);
}

The call to str_replace('@@@'.$id.'@@@', $text, $content) inside the loop for every replacement scans $content for a rather complex placeholder like @@@STRING_blablablablabla.blablablablabl@@@ produced by earlier calls to uniqid().

My first idea was to simplify the placeholder which lead to a huge performance gain in my case with the downside of an increased likelihood of unintended replacements (@@@S_1@@@ has much higher chance of beeing real PDF content than @@@STRING_blablablablabla.blablablablabl@@@).

Second idea now is to slightly reduce complexity in the placeholder with more_entropy=false combined with above mentioned regular expression callback preg_replace_callback().

I will do a few tests, compare performance and make a suggestion.

@GreyWyvern
Copy link
Contributor

You might want to benchmark using preg_replace() with a limit of 1 instead of str_replace() which can't be limited. The preg_replace() call would quit right after the first match, while the str_replace() would continue to search through the whole document. That might save some processing?

Maybe even strtr() would be faster...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants