Skip to content

Commit

Permalink
Remove/change deprecated paginator stuff
Browse files Browse the repository at this point in the history
Removes the `PaginatorInterface` and the old version of the
`AbstractPaginator`. Also remove an unnecessary argument from the
`processLoaded` method and the default implementation of the
`getNextRequest()` method (child classes have to implement it
themselves.
  • Loading branch information
otsch committed Aug 8, 2024
1 parent 26cbc92 commit 4a89632
Show file tree
Hide file tree
Showing 11 changed files with 60 additions and 242 deletions.
15 changes: 8 additions & 7 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [2.0.0] - 2024-x-x
### Changed
* __BREAKING__: Removed methods `BaseStep::addToResult()`, `BaseStep::addLaterToResult()`, `BaseStep::addsToOrCreatesResult()`, `BaseStep::createsResult()` and `BaseStep::keepInputData()`. They have already been deprecated in v1.8.0 and shall be replaced with `Step::keep()` and `Step::keepAs()`, `Step::keepFromInput()` and `Step::keepInputAs()`.
* __BREAKING__: As the `addToResult()` method was removed, the library does not use `toArrayForAddToResult()` methods on output objects any longer. Instead, please use `toArrayForResult()`. Therefore, also the `RespondedRequest::toArrayForAddToResult()` is renamed to `RespondedRequest::toArrayForResult()`.
* __BREAKING__: Removed the `result` and `addLaterToResult` properties from `Io` objects (so `Input` and `Output`). They were part of the whole `addToResult` feature and are therefore removed. Instead, there is the `keep` property where kept data is added.
* __BREAKING__: The return type of the `Crawler::loader()` method was changed to no longer allow `array`. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the functionality described below, to directly provide a custom loader to a step.
* __BREAKING__: Refactored the abstract `LoadingStep` class to a trait and removed the `LoadingStepInterface`. Loading steps should now just extend the `Step` class and use the trait. As it is no longer possible to have multiple loaders, the `addLoader` method was renamed to `setLoader`. For the same reason, the methods `useLoader()` and `usesLoader()`, to choose one of multiple loaders from the crawler by key, are removed. Instead, you can now directly provide a different loader to a single step (instead to the crawler), using the trait's new `withLoader()` method (e.g. `Http::get()->withLoader($loader)`).
* __BREAKING__: The `HttpLoader::retryCachedErrorResponses()` method now returns an instance of the new `Crwlr\Crawler\Loader\Http\Cache\RetryManager` class, providing the methods `only()` and `except()` that can be used to restrict retries to certain HTTP response status codes. Previously the method returned the `HttpLoader` itself (`$this`), so if you're using it in a chain and call other loader methods after it, you need to refactor this.
* __BREAKING__: Removed the `Microseconds` class from this package. It was moved to the `crwlr/utils` package that you can use instead.
* __BREAKING__: Removed methods `BaseStep::addToResult()`, `BaseStep::addLaterToResult()`, `BaseStep::addsToOrCreatesResult()`, `BaseStep::createsResult()`, and `BaseStep::keepInputData()`. These methods were deprecated in v1.8.0 and should be replaced with `Step::keep()`, `Step::keepAs()`, `Step::keepFromInput()`, and `Step::keepInputAs()`.
* __BREAKING__: With the removal of the `addToResult()` method, the library no longer uses `toArrayForAddToResult()` methods on output objects. Instead, please use `toArrayForResult()`. Consequently, `RespondedRequest::toArrayForAddToResult()` has been renamed to `RespondedRequest::toArrayForResult()`.
* __BREAKING__: Removed the `result` and `addLaterToResult` properties from `Io` objects (`Input` and `Output`). These properties were part of the `addToResult` feature and are now removed. Instead, use the `keep` property where kept data is added.
* __BREAKING__: The return type of the `Crawler::loader()` method no longer allows `array`. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the new functionality to directly provide a custom loader to a step described below.
* __BREAKING__: Refactored the abstract `LoadingStep` class to a trait and removed the `LoadingStepInterface`. Loading steps should now extend the `Step` class and use the trait. As multiple loaders are no longer supported, the `addLoader` method was renamed to `setLoader`. Similarly, the methods `useLoader()` and `usesLoader()` for selecting loaders by key are removed. Now, you can directly provide a different loader to a single step using the trait's new `withLoader()` method (e.g., `Http::get()->withLoader($loader)`).
* __BREAKING__: Removed the `PaginatorInterface` to allow for better extensibility. The old `Crwlr\Crawler\Steps\Loading\Http\Paginators\AbstractPaginator` class has also been removed. Please use the newer, improved version `Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator`. This newer version has also changed: the first argument `UriInterface $url` is removed from the `processLoaded()` method, as the URL also is part of the request (`Psr\Http\Message\RequestInterface`) which is now the first argument. Additionally, the default implementation of the `getNextRequest()` method is removed. Child implementations must define this method themselves. If your custom paginator still has a `getNextUrl()` method, note that it is no longer needed by the library and will not be called. The `getNextRequest()` method now fulfills its original purpose.
* __BREAKING__: The `HttpLoader::retryCachedErrorResponses()` method now returns an instance of the new `Crwlr\Crawler\Loader\Http\Cache\RetryManager` class. This class provides the methods `only()` and `except()` to restrict retries to specific HTTP response status codes. Previously, this method returned the `HttpLoader` itself (`$this`), so if you're using it in a chain and calling other loader methods after it, you will need to refactor your code.
* __BREAKING__: Removed the `Microseconds` class from this package. It has been moved to the `crwlr/utils` package, which you can use instead.

## [1.10.0] - 2024-08-05
### Added
Expand Down
3 changes: 1 addition & 2 deletions src/Steps/Loading/Http.php
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
use Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator;
use Crwlr\Crawler\Steps\Loading\Http\Paginate;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;
use Crwlr\Crawler\Steps\Loading\Http\PaginatorInterface;
use Crwlr\Crawler\Steps\StepOutputType;
use Crwlr\Crawler\Utils\Gzip;
use Exception;
Expand Down Expand Up @@ -106,7 +105,7 @@ public static function getBodyString(MessageInterface|RespondedRequest $message)
* @throws InvalidDomQueryException
*/
public function paginate(
PaginatorInterface|AbstractPaginator|string $paginator,
AbstractPaginator|string $paginator,
int $defaultPaginatorMaxPages = Paginator::MAX_PAGES_DEFAULT,
): Paginate {
if (is_string($paginator)) {
Expand Down
27 changes: 1 addition & 26 deletions src/Steps/Loading/Http/AbstractPaginator.php
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
use Crwlr\Crawler\Utils\RequestKey;
use Crwlr\Url\Url;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\UriInterface;
use Psr\Log\LoggerInterface;

abstract class AbstractPaginator
Expand All @@ -32,7 +31,6 @@ abstract class AbstractPaginator
public function __construct(protected int $maxPages = Paginator::MAX_PAGES_DEFAULT) {}

public function processLoaded(
UriInterface $url,
RequestInterface $request,
?RespondedRequest $respondedRequest,
): void {
Expand Down Expand Up @@ -70,29 +68,6 @@ public function stopWhen(Closure|StopRule $callback): self
return $this;
}

/**
* Default implementation of getNextRequest() that will be remove in v2.
* Initially it was required that an implementation has a getNextUrl() method.
* As paginating is not always only done via the URL, it's better to have a getNextRequest() method
* to be more flexible. Until v2 of this library this method makes the next request, using the
* getNextUrl() method. In v2 it will then be required, that Paginator implementations, implement
* their own getNextRequest() method and getNextUrl() won't be required anymore.
*/
public function getNextRequest(): ?RequestInterface
{
if (!$this->latestRequest || !method_exists($this, 'getNextUrl')) {
return null;
}

$nextUrl = $this->getNextUrl();

if (!$nextUrl) {
return null;
}

return $this->latestRequest->withUri(Url::parsePsr7($nextUrl));
}

public function logWhenFinished(LoggerInterface $logger): void
{
if ($this->maxPagesReached()) {
Expand All @@ -105,7 +80,7 @@ public function logWhenFinished(LoggerInterface $logger): void
/**
* For v2. See above.
*/
//abstract public function getNextRequest(): ?RequestInterface;
abstract public function getNextRequest(): ?RequestInterface;

protected function registerLoadedRequest(RequestInterface|RespondedRequest $request): void
{
Expand Down
38 changes: 4 additions & 34 deletions src/Steps/Loading/Http/Paginate.php
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@
namespace Crwlr\Crawler\Steps\Loading\Http;

use Crwlr\Crawler\Loader\Http\Exceptions\LoadingException;
use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;
use Exception;
use Generator;
use Psr\Http\Message\RequestInterface;
Expand All @@ -15,7 +13,7 @@
class Paginate extends Http
{
public function __construct(
protected Http\PaginatorInterface|AbstractPaginator $paginator,
protected AbstractPaginator $paginator,
string $method = 'GET',
array $headers = [],
string|StreamInterface|null $body = null,
Expand All @@ -39,17 +37,13 @@ protected function invoke(mixed $input): Generator
}

try {
$this->paginator->processLoaded($input, $request, $response);
$this->paginator->processLoaded($request, $response);
} catch (Exception $exception) {
$this->logger?->error('Paginate Error: ' . $exception->getMessage());
}

while (!$this->paginator->hasFinished()) {
if (!method_exists($this->paginator, 'getNextRequest')) { // Remove in v2
$request = $this->getNextRequestLegacy($response);
} else {
$request = $this->paginator->getNextRequest();
}
$request = $this->paginator->getNextRequest();

if (!$request) {
break;
Expand All @@ -62,7 +56,7 @@ protected function invoke(mixed $input): Generator
}

try {
$this->paginator->processLoaded($request->getUri(), $request, $response);
$this->paginator->processLoaded($request, $response);
} catch (Exception $exception) {
$this->logger?->error('Paginate Error: ' . $exception->getMessage());
}
Expand Down Expand Up @@ -94,28 +88,4 @@ protected function getRequestFromInput(mixed $input): RequestInterface

return $this->getRequestFromInputUri($input);
}

/**
* @deprecated Legacy method, remove in v2
*/
protected function getNextRequestLegacy(?RespondedRequest $previousResponse): ?RequestInterface
{
if (!method_exists($this->paginator, 'getNextUrl')) {
return null;
}

$nextUrl = $this->paginator->getNextUrl();

if (!$nextUrl) {
return null;
}

$request = $this->getRequestFromInputUri(Url::parsePsr7($nextUrl));

if (method_exists($this->paginator, 'prepareRequest')) {
$request = $this->paginator->prepareRequest($request, $previousResponse);
}

return $request;
}
}
34 changes: 0 additions & 34 deletions src/Steps/Loading/Http/PaginatorInterface.php

This file was deleted.

27 changes: 0 additions & 27 deletions src/Steps/Loading/Http/Paginators/AbstractPaginator.php

This file was deleted.

16 changes: 0 additions & 16 deletions src/Steps/Loading/Http/Paginators/SimpleWebsitePaginator.php
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@
use Crwlr\Url\Url;
use Exception;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\UriInterface;
use Psr\Log\LoggerInterface;
use Symfony\Component\DomCrawler\Crawler;

Expand Down Expand Up @@ -56,20 +55,6 @@ public function hasFinished(): bool
return $this->maxPagesReached() || empty($this->found) || $this->hasFinished;
}

/**
* Remove in v2.
*/
public function getNextUrl(): ?string
{
$found = array_shift($this->found);

if (is_array($found)) {
return $found['url'];
}

return null;
}

public function getNextRequest(): ?RequestInterface
{
if (!$this->latestRequest) {
Expand All @@ -93,7 +78,6 @@ public function getNextRequest(): ?RequestInterface
* @throws Exception
*/
public function processLoaded(
UriInterface $url,
RequestInterface $request,
?RespondedRequest $respondedRequest,
): void {
Expand Down
Loading

0 comments on commit 4a89632

Please sign in to comment.