Skip to content

Commit

Permalink
Remove addToResult() and multiple loaders
Browse files Browse the repository at this point in the history
Remove the methods addToResult(), addLaterToResult() and everything
connected to that.

Also, it is no longer possible to provide multiple loaders via the
crawler. Instead you can now manually provide customized loaders
directly to steps via the new withLoader() method.

Further, also remove the Microseconds util class, which is now part of
the crwlr/utils package.
  • Loading branch information
otsch committed Aug 7, 2024
1 parent 3238240 commit d81f5f2
Show file tree
Hide file tree
Showing 45 changed files with 637 additions and 2,293 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [2.0.0] - 2024-x-x
### Changed
* __BREAKING__: Removed methods `BaseStep::addToResult()`, `BaseStep::addLaterToResult()`, `BaseStep::addsToOrCreatesResult()`, `BaseStep::createsResult()` and `BaseStep::keepInputData()`. They have already been deprecated in v1.8.0 and shall be replaced with `Step::keep()` and `Step::keepAs()`, `Step::keepFromInput()` and `Step::keepInputAs()`.
* __BREAKING__: As the `addToResult()` method was removed, the library does not use `toArrayForAddToResult()` methods on output objects any longer. Instead, please use `toArrayForResult()`. Therefore, also the `RespondedRequest::toArrayForAddToResult()` is renamed to `RespondedRequest::toArrayForResult()`.
* __BREAKING__: Removed the `result` and `addLaterToResult` properties from `Io` objects (so `Input` and `Output`). They were part of the whole `addToResult` feature and are therefore removed. Instead, there is the `keep` property where kept data is added.
* __BREAKING__: The return type of the `Crawler::loader()` method was changed to no longer allow `array`. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the functionality described below, to directly provide a custom loader to a step.
* __BREAKING__: Refactored the abstract `LoadingStep` class to a trait and removed the `LoadingStepInterface`. Loading steps should now just extend the `Step` class and use the trait. As it is no longer possible to have multiple loaders, the `addLoader` method was renamed to `setLoader`. For the same reason, the methods `useLoader()` and `usesLoader()`, to choose one of multiple loaders from the crawler by key, are removed. Instead, you can now directly provide a different loader to a single step (instead to the crawler), using the trait's new `withLoader()` method (e.g. `Http::get()->withLoader($loader)`).
* __BREAKING__: The `HttpLoader::retryCachedErrorResponses()` method now returns an instance of the new `Crwlr\Crawler\Loader\Http\Cache\RetryManager` class, providing the methods `only()` and `except()` that can be used to restrict retries to certain HTTP response status codes. Previously the method returned the `HttpLoader` itself (`$this`), so if you're using it in a chain and call other loader methods after it, you need to refactor this.
* __BREAKING__: Removed the `Microseconds` class from this package. It was moved to the `crwlr/utils` package that you can use instead.

## [1.10.0] - 2024-08-05
### Added
Expand Down
4 changes: 2 additions & 2 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,8 @@
}
},
"scripts": {
"test": "pest --exclude-group integration --display-warnings",
"test-integration": "pest --group integration --display-warnings",
"test": "pest --exclude-group integration --display-warnings --bail",
"test-integration": "pest --group integration --display-warnings --bail",
"stan": "@php -d memory_limit=4G vendor/bin/phpstan analyse",
"cs": "php-cs-fixer fix -v --dry-run",
"cs-fix": "php-cs-fixer fix -v",
Expand Down
126 changes: 21 additions & 105 deletions src/Crawler.php
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,11 @@
namespace Crwlr\Crawler;

use Closure;
use Crwlr\Crawler\Exceptions\UnknownLoaderKeyException;
use Crwlr\Crawler\Loader\AddLoadersToStepAction;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\Logger\CliLogger;
use Crwlr\Crawler\Steps\BaseStep;
use Crwlr\Crawler\Steps\Exceptions\PreRunValidationException;
use Crwlr\Crawler\Steps\Group;
use Crwlr\Crawler\Steps\Step;
use Crwlr\Crawler\Steps\StepInterface;
use Crwlr\Crawler\Stores\StoreInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
Expand All @@ -24,9 +21,9 @@ abstract class Crawler
protected UserAgentInterface $userAgent;

/**
* @var LoaderInterface|array<string, LoaderInterface>
* @var LoaderInterface
*/
protected LoaderInterface|array $loader;
protected LoaderInterface $loader;

protected LoggerInterface $logger;

Expand Down Expand Up @@ -68,9 +65,9 @@ abstract protected function userAgent(): UserAgentInterface;
/**
* @param UserAgentInterface $userAgent
* @param LoggerInterface $logger
* @return LoaderInterface|array<string, LoaderInterface>
* @return LoaderInterface
*/
abstract protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array;
abstract protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface;

public static function group(): Group
{
Expand Down Expand Up @@ -146,24 +143,17 @@ public function inputs(array $inputs): static
}

/**
* @param string|StepInterface $stepOrResultKey
* @param StepInterface|null $step
* @param StepInterface $step
* @return $this
* @throws InvalidArgumentException|UnknownLoaderKeyException
* @throws InvalidArgumentException
*/
public function addStep(string|StepInterface $stepOrResultKey, ?StepInterface $step = null): static
public function addStep(StepInterface $step): static
{
if (is_string($stepOrResultKey) && $step === null) {
throw new InvalidArgumentException('No StepInterface object provided');
} elseif (is_string($stepOrResultKey)) {
$step->addToResult($stepOrResultKey);
} else {
$step = $stepOrResultKey;
}

$step->addLogger($this->logger);

(new AddLoadersToStepAction($this->loader, $step))->invoke();
if (method_exists($step, 'setLoader')) {
$step->setLoader($this->loader);
}

if ($step instanceof BaseStep) {
$step->setParentCrawler($this);
Expand Down Expand Up @@ -266,8 +256,8 @@ protected function invokeStepsRecursive(Input $input, StepInterface $step, int $

$nextStep = $this->nextStep($stepIndex);

if (!$nextStep && $input->result === null) {
yield from $this->storeAndReturnResults($outputs, $step->createsResult() === true, true);
if (!$nextStep) {
yield from $this->storeAndReturnOutputsAsResults($outputs);

return;
}
Expand All @@ -279,85 +269,22 @@ protected function invokeStepsRecursive(Input $input, StepInterface $step, int $

$this->outputHook?->call($this, $output, $stepIndex, $step);

if ($nextStep) {
if ($input->result === null && $step->createsResult()) {
$childOutputs = $this->invokeStepsRecursive(
new Input($output),
$nextStep,
$stepIndex + 1,
);

/** @var Generator<Output> $childOutputs */

yield from $this->storeAndReturnResults($childOutputs, true);
} else {
yield from $this->invokeStepsRecursive(
new Input($output),
$nextStep,
$stepIndex + 1,
);
}
} else {
yield $output;
}
}
}

/**
* @param Generator<Output> $outputs
* @return Generator<Result>
*/
protected function storeAndReturnResults(
Generator $outputs,
bool $manuallyDefinedResults = false,
bool $callOutputHook = false,
): Generator {
if ($manuallyDefinedResults || $this->anyResultKeysDefinedInSteps()) {
yield from $this->storeAndReturnDefinedResults($outputs, $callOutputHook);
} else {
yield from $this->storeAndReturnOutputsAsResults($outputs, $callOutputHook);
}
}

/**
* @param Generator<Output> $outputs
* @return Generator<Result>
*/
protected function storeAndReturnDefinedResults(Generator $outputs, bool $callOutputHook = false): Generator
{
$results = [];

foreach ($outputs as $output) {
if ($callOutputHook) {
$this->outputHook?->call($this, $output, count($this->steps) - 1, end($this->steps));
}

if ($output->result !== null && !in_array($output->result, $results, true)) {
$results[] = $output->result;
} elseif ($output->addLaterToResult !== null && !in_array($output->addLaterToResult, $results, true)) {
$results[] = new Result($output->addLaterToResult);
}
}

// yield results only after iterating over final outputs, because that could still add properties to result
// resources.
foreach ($results as $result) {
$this->store?->store($result);

yield $result;
yield from $this->invokeStepsRecursive(
new Input($output),
$nextStep,
$stepIndex + 1,
);
}
}

/**
* @param Generator<Output> $outputs
* @return Generator<Result>
*/
protected function storeAndReturnOutputsAsResults(Generator $outputs, bool $callOutputHook = false): Generator
protected function storeAndReturnOutputsAsResults(Generator $outputs): Generator
{
foreach ($outputs as $output) {
if ($callOutputHook) {
$this->outputHook?->call($this, $output, count($this->steps) - 1, end($this->steps));
}
$this->outputHook?->call($this, $output, count($this->steps) - 1, end($this->steps));

$result = new Result();

Expand Down Expand Up @@ -420,17 +347,6 @@ protected function prepareInput(): array
}, $this->inputs);
}

protected function anyResultKeysDefinedInSteps(): bool
{
foreach ($this->steps as $step) {
if ($step->addsToOrCreatesResult()) {
return true;
}
}

return false;
}

protected function logMemoryUsage(): void
{
$memoryUsage = memory_get_usage();
Expand All @@ -445,11 +361,11 @@ protected function firstStep(): ?StepInterface
return $this->steps[0] ?? null;
}

protected function lastStep(): ?Step
protected function lastStep(): ?BaseStep
{
$lastStep = end($this->steps);

if (!$lastStep instanceof Step) {
if (!$lastStep instanceof BaseStep) {
return null;
}

Expand Down
7 changes: 0 additions & 7 deletions src/Exceptions/UnknownLoaderKeyException.php

This file was deleted.

4 changes: 2 additions & 2 deletions src/HttpCrawler.php
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@
abstract class HttpCrawler extends Crawler
{
/**
* @return LoaderInterface|array<string, LoaderInterface>
* @return LoaderInterface
*/
protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
{
return new HttpLoader($userAgent, logger: $logger);
}
Expand Down
8 changes: 1 addition & 7 deletions src/Io.php
Original file line number Diff line number Diff line change
Expand Up @@ -13,24 +13,18 @@ class Io
*/
final public function __construct(
protected mixed $value,
public ?Result $result = null,
public ?Result $addLaterToResult = null,
public array $keep = [],
) {
if ($value instanceof self) {
$this->value = $value->value;

$this->result ??= $value->result;

$this->addLaterToResult ??= $value->addLaterToResult;

$this->keep = $value->keep;
}
}

public function withValue(mixed $value): static
{
return new static($value, $this->result, $this->addLaterToResult, $this->keep);
return new static($value, $this->keep);
}

public function withPropertyValue(string $key, mixed $value): static
Expand Down
63 changes: 0 additions & 63 deletions src/Loader/AddLoadersToStepAction.php

This file was deleted.

5 changes: 4 additions & 1 deletion src/Loader/Http/Messages/RespondedRequest.php
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

namespace Crwlr\Crawler\Loader\Http\Messages;

use Crwlr\Crawler\Cache\Exceptions\MissingZlibExtensionException;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Utils\RequestKey;
use Crwlr\Url\Url;
Expand Down Expand Up @@ -57,6 +58,7 @@ public static function cacheKeyFromRequest(RequestInterface $request): string

/**
* @return mixed[]
* @throws MissingZlibExtensionException
*/
public function __serialize(): array
{
Expand All @@ -74,8 +76,9 @@ public function __serialize(): array

/**
* @return mixed[]
* @throws MissingZlibExtensionException
*/
public function toArrayForAddToResult(): array
public function toArrayForResult(): array
{
$serialized = $this->__serialize();

Expand Down
8 changes: 0 additions & 8 deletions src/Loader/Http/Politeness/TimingUnits/Microseconds.php

This file was deleted.

10 changes: 0 additions & 10 deletions src/Loader/Loader.php
Original file line number Diff line number Diff line change
Expand Up @@ -78,16 +78,6 @@ protected function isAllowedToBeLoaded(UriInterface $uri, bool $throwsException
return true;
}

/**
* Can be implemented in a child class to track how long a request waited for its response.
*/
protected function trackRequestStart(?float $microtime = null): void {}

/**
* Can be implemented in a child class to track how long a request waited for its response.
*/
protected function trackRequestEnd(?float $microtime = null): void {}

protected function callHook(string $hook, mixed ...$arguments): void
{
if (!array_key_exists($hook, $this->hooks)) {
Expand Down
Loading

0 comments on commit d81f5f2

Please sign in to comment.