Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V3 with new DOM API and Array Step Output Filter #176

Merged
merged 2 commits into from
Dec 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,28 @@ jobs:
- name: Run integration tests
run: composer test-integration

tests84:
name: PestPHP Tests Running only on PHP >= 8.4
runs-on: ubuntu-latest
strategy:
matrix:
php-versions: [ '8.4' ]

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Install PHP
uses: shivammathur/setup-php@v2
with:
php-version: ${{ matrix.php-versions }}

- name: Install dependencies
run: composer install --prefer-dist --no-progress

- name: Run tests
run: composer test-php84

stanAndCs:
name: Static Analysis (phpstan) and Code Style (PHP CS Fixer)
runs-on: ubuntu-latest
Expand Down
27 changes: 27 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,33 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [3.0.0] - 2024-12-08
The primary change in version 3.0.0 is that the library now leverages PHP 8.4’s new DOM API when used in an environment with PHP >= 8.4. To maintain compatibility with PHP < 8.4, an abstraction layer has been implemented. This layer dynamically uses either the Symfony DomCrawler component or the new DOM API, depending on the PHP version.

Since no direct interaction with an instance of the Symfony DomCrawler library was required at the step level provided by the library, it is highly likely that you won’t need to make any changes to your code to upgrade to v3. To ensure a smooth transition, please review the points under “Changed.”

If you're using XPath queries for data extraction, please try to switch to using CSS selectors instead, because XPath is no longer supported by the new DOM API. Therefor XPath related functionality was deprecated in this version of the library and will probably be removed in the next major version.

### Changed
* __BREAKING__: The `DomQuery::innerText()` method (a.k.a. `Dom::cssSelector('...')->innerText()`) has been removed. `innerText` exists only in the Symfony DomCrawler component, and its usefulness is questionable. If you still require this variant of the DOM element text, please let us know or create a pull request yourself. Thank you!
* __BREAKING__: The `DomQueryInterface` was removed. As the `DomQuery` class offers a lot more functionality than the interface defines, the purpose of the interface was questionable. Please use the abstract `DomQuery` class instead. This also means that some method signatures, type hinting the interface, have changed. Look for occurences of `DomQueryInterface` and replace them.
* __BREAKING__: The visibility of the `DomQuery::filter()` method was changed from public to protected. It is still needed in the `DomQuery` class, but outside of it, it is probably better and easier to directly use the new DOM abstraction (see the `src/Steps/Dom` directory). If you are extending the `DomQuery` class (which is not recommended), be aware that the argument now takes a `Node` (from the new DOM abstraction) instead of a Symfony `Crawler`.
* __BREAKING__: The `Step::validateAndSanitizeToDomCrawlerInstance()` method was removed. Please use the `Step::validateAndSanitizeToHtmlDocumentInstance()` and `Step::validateAndSanitizeToXmlDocumentInstance()` methods instead.
* __BREAKING__: The second argument in `Closure`s passed to the `Http::crawl()->customFilter()` has changed from an instance of Symfony `Crawler` class, to an `HtmlElement` instance from the new DOM abstraction (`Crwlr\Crawler\Steps\Dom\HtmlElement`).
* __BREAKING__: The Filter class was split into `AbstractFilter` (base class for actual filter classes) and `Filter` only hosting the static function for easy instantiation, because otherwise each filter class also has all the static methods.
* __BREAKING__: Further, the signatures of some methods that are mainly here for internal usage, have changed due to the new DOM abstraction:
* The static `GetLink::isSpecialNonHttpLink()` method now needs an instance of `HtmlElement` instead of a Symfony `Crawler`.
* `GetUrlsFromSitemap::fixUrlSetTag()` now takes an `XmlDocument` instead of a Symfony `Crawler`.
* The `DomQuery::apply()` method now takes a `Node` instead of a Symfony `Crawler`.

### Deprecated
* `Dom::xPath()` method and
* the `XPathQuery` class as well as
* the new `Node::queryXPath()` method.

### Added
* New step output filter `Filter::arrayHasElement()`. When a step produces array output with a property being a numeric array, you can now filter outputs by checking if one element of that array property, matches certain filter criteria. Example: The outputs look like `['foo' => 'bar', 'baz' => ['one', 'two', 'three']]`. You can filter all outputs where `baz` contains `two` like: `Filter::arrayHasElement()->where('baz', Filter::equal('two'))`.

## [2.1.3] - 2024-11-05
### Fixed
* Improvements for deprecations in PHP 8.4.
Expand Down
5 changes: 3 additions & 2 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
"guzzlehttp/guzzle": "^7.4",
"adbario/php-dot-notation": "^3.1",
"chrome-php/chrome": "^1.7",
"crwlr/utils": "^1.1",
"crwlr/utils": "^1.2",
"crwlr/html-2-text": "^0.1.0"
},
"require-dev": {
Expand Down Expand Up @@ -74,7 +74,8 @@
}
},
"scripts": {
"test": "pest --exclude-group integration --display-warnings --bail",
"test": "pest --exclude-group integration,php84 --display-warnings --bail",
"test-php84": "pest --group php84 --display-warnings --bail",
"test-integration": "pest --group integration --display-warnings --bail",
"stan": "@php -d memory_limit=4G vendor/bin/phpstan analyse",
"cs": "php-cs-fixer fix -v --dry-run",
Expand Down
9 changes: 9 additions & 0 deletions phpstan.neon
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,12 @@ parameters:
- "#^Access to an undefined property Spatie\\\\Invade\\\\Invader#"
- "#^Call to an undefined method Spatie\\\\Invade\\\\Invader#"
- "#^Call to protected method [a-zA-Z]{5,30}\\(\\) of class PHPUnit\\\\Framework\\\\TestCase.#"
- "#^(?:Parameter|Method) .+ has invalid (return )?type Dom\\\\.+\\.#"
- "#^Call to .+ on an unknown class Dom\\\\.+\\.#"
- "#^Property .+ has unknown class Dom\\\\.+ as its type\\.#"
- "#^Class Dom\\\\.+ not found.#"
- "#^Access to property .+ on an unknown class Dom\\\\.+\\.#"
- "#^PHPDoc tag .+ contains unknown class Dom\\\\.+\\.#"
- "#^Call to an undefined (static )?method Dom\\\\.+::.+\\(\\)\\.#"
- "#^Access to an undefined property Dom\\\\.+::\\$.+\\.#"
- "#^Function .+ has invalid return type Dom\\\\.+\\.#"
74 changes: 3 additions & 71 deletions src/Steps/BaseStep.php
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,9 @@
use Crwlr\Crawler\Output;
use Crwlr\Crawler\Result;
use Crwlr\Crawler\Steps\Exceptions\PreRunValidationException;
use Crwlr\Crawler\Steps\Filters\FilterInterface;
use Crwlr\Crawler\Steps\Filters\Filterable;
use Crwlr\Crawler\Steps\Refiners\RefinerInterface;
use Crwlr\Crawler\Utils\OutputTypeHelper;
use Exception;
use Generator;
use InvalidArgumentException;
use Psr\Log\LoggerInterface;
Expand All @@ -24,6 +23,8 @@

abstract class BaseStep implements StepInterface
{
use Filterable;

/**
* true means: keep the whole output array/object
* string: keep that one key from the (array/object) output
Expand Down Expand Up @@ -69,11 +70,6 @@ abstract class BaseStep implements StepInterface
*/
protected array $uniqueOutputKeys = [];

/**
* @var FilterInterface[]
*/
protected array $filters = [];

/**
* @var array<Closure|RefinerInterface|array{ key: string, refiner: Closure|RefinerInterface}>
*/
Expand Down Expand Up @@ -183,47 +179,6 @@ public function uniqueOutputs(?string $key = null): static
return $this;
}

public function where(string|FilterInterface $keyOrFilter, ?FilterInterface $filter = null): static
{
if (is_string($keyOrFilter) && $filter === null) {
throw new InvalidArgumentException('You have to provide a Filter (instance of FilterInterface)');
} elseif (is_string($keyOrFilter)) {
if ($this->isOutputKeyAlias($keyOrFilter)) {
$keyOrFilter = $this->getOutputKeyAliasRealKey($keyOrFilter);
}

$filter->useKey($keyOrFilter);

$this->filters[] = $filter;
} else {
$this->filters[] = $keyOrFilter;
}

return $this;
}

/**
* @throws Exception
*/
public function orWhere(string|FilterInterface $keyOrFilter, ?FilterInterface $filter = null): static
{
if (empty($this->filters)) {
throw new Exception('No where before orWhere');
} elseif (is_string($keyOrFilter) && $filter === null) {
throw new InvalidArgumentException('You have to provide a Filter (instance of FilterInterface)');
} elseif (is_string($keyOrFilter)) {
$filter->useKey($keyOrFilter);
} else {
$filter = $keyOrFilter;
}

$lastFilter = end($this->filters);

$lastFilter->addOr($filter);

return $this;
}

public function refineOutput(
string|Closure|RefinerInterface $keyOrRefiner,
null|Closure|RefinerInterface $refiner = null,
Expand Down Expand Up @@ -539,29 +494,6 @@ protected function inputOrOutputIsUnique(Io $io): bool
return true;
}

protected function passesAllFilters(mixed $output): bool
{
foreach ($this->filters as $filter) {
if (!$filter->evaluate($output)) {
if ($filter->getOr()) {
$orFilter = $filter->getOr();

while ($orFilter) {
if ($orFilter->evaluate($output)) {
continue 2;
}

$orFilter = $orFilter->getOr();
}
}

return false;
}
}

return true;
}

protected function applyRefiners(mixed $outputValue, mixed $inputValue): mixed
{
foreach ($this->refiners as $refiner) {
Expand Down
Loading
Loading