Skip to content

Commit

Permalink
Better fix for non UTF-8 characters
Browse files Browse the repository at this point in the history
Improved fix for non UTF-8 characters in HTML documents declared as UTF-8.
  • Loading branch information
otsch authored Dec 11, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
1 parent 1993669 commit 4904ed4
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 2 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### [3.0.3] - 2024-12-11
### Fixed
* Improved fix for non UTF-8 characters in HTML documents declared as UTF-8.

## [3.0.2] - 2024-12-11
### Fixed
* When the new PHP 8.4 DOM API is used, and HTML declared as UTF-8 contains non UTF-8 compatible characters, it does not replace them with a � character, but instead removes it. This behaviour is consistent with the data returned by Symfony DomCrawler.
Expand Down
30 changes: 30 additions & 0 deletions src/Steps/Dom/HtmlDocument.php
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,40 @@ protected function makeChildNodeInstance(object $node): Node
*/
protected function makeDocumentInstance(string $source): object
{
$source = $this->fixInvalidCharactersInSource($source);

if (PhpVersion::isAtLeast(8, 4)) {
return \Dom\HTMLDocument::createFromString($source, HTML_NO_DEFAULT_NS | LIBXML_NOERROR);
}

return new Crawler($source);
}

/**
* Converts charset to HTML-entities to ensure valid parsing.
*/
private function fixInvalidCharactersInSource(string $source): string
{
if (function_exists('iconv')) {
$charset = preg_match('//u', $source) ? 'UTF-8' : 'ISO-8859-1';

preg_match('/(charset *= *["\']?)([a-zA-Z\-0-9_:.]+)/i', $source, $matches);

if ($matches && !empty($matches[2])) {
$declaredCharset = strtoupper($matches[2]);
} else {
$declaredCharset = null;
}

if ($charset === 'ISO-8859-1' && $declaredCharset === 'UTF-8') {
$fixedSource = iconv("ISO-8859-1", "UTF-8//TRANSLIT", $source);

if ($fixedSource !== false) {
$source = $fixedSource;
}
}
}

return $source;
}
}
4 changes: 2 additions & 2 deletions tests/_Integration/Http/CharsetTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ protected function userAgent(): UserAgentInterface
}
}

it('removes (and not replaces with broken ? replacement char) non utf-8 characters from extracted data', function () {
it('Fixes non UTF-8 characters in HTML documents declared as UTF-8', function () {
$crawler = new CharsetExampleCrawler();

$crawler
Expand All @@ -37,5 +37,5 @@ protected function userAgent(): UserAgentInterface
$results = helper_generatorToArray($crawler->run());

expect($results)->toHaveCount(1)
->and($results[0]->toArray())->toBe(['foo' => '0 l/m']);
->and($results[0]->toArray())->toBe(['foo' => '0 l/m²']);
});

0 comments on commit 4904ed4

Please sign in to comment.