Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make maxOutputs() work with Group steps #162

Merged
merged 3 commits into from
Sep 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
* __BREAKING__: Removed methods `BaseStep::addToResult()`, `BaseStep::addLaterToResult()`, `BaseStep::addsToOrCreatesResult()`, `BaseStep::createsResult()`, and `BaseStep::keepInputData()`. These methods were deprecated in v1.8.0 and should be replaced with `Step::keep()`, `Step::keepAs()`, `Step::keepFromInput()`, and `Step::keepInputAs()`.
* __BREAKING__: With the removal of the `addToResult()` method, the library no longer uses `toArrayForAddToResult()` methods on output objects. Instead, please use `toArrayForResult()`. Consequently, `RespondedRequest::toArrayForAddToResult()` has been renamed to `RespondedRequest::toArrayForResult()`.
* __BREAKING__: Removed the `result` and `addLaterToResult` properties from `Io` objects (`Input` and `Output`). These properties were part of the `addToResult` feature and are now removed. Instead, use the `keep` property where kept data is added.
* __BREAKING__: The return type of the `Crawler::loader()` method no longer allows `array`. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the new functionality to directly provide a custom loader to a step described below.
* __BREAKING__: The signature of the `Crawler::addStep()` method has changed. You can no longer provide a result key as the first parameter. Previously, this key was passed to the `Step::addToResult()` method internally. Now, please handle this call yourself.
* __BREAKING__: The return type of the `Crawler::loader()` method no longer allows `array`. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the new functionality to directly provide a custom loader to a step described below. As part of this change, the `UnknownLoaderKeyException` was also removed as it is now obsolete. If you have any references to this class, please make sure to remove them.
* __BREAKING__: Refactored the abstract `LoadingStep` class to a trait and removed the `LoadingStepInterface`. Loading steps should now extend the `Step` class and use the trait. As multiple loaders are no longer supported, the `addLoader` method was renamed to `setLoader`. Similarly, the methods `useLoader()` and `usesLoader()` for selecting loaders by key are removed. Now, you can directly provide a different loader to a single step using the trait's new `withLoader()` method (e.g., `Http::get()->withLoader($loader)`).
* __BREAKING__: Removed the `PaginatorInterface` to allow for better extensibility. The old `Crwlr\Crawler\Steps\Loading\Http\Paginators\AbstractPaginator` class has also been removed. Please use the newer, improved version `Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator`. This newer version has also changed: the first argument `UriInterface $url` is removed from the `processLoaded()` method, as the URL also is part of the request (`Psr\Http\Message\RequestInterface`) which is now the first argument. Additionally, the default implementation of the `getNextRequest()` method is removed. Child implementations must define this method themselves. If your custom paginator still has a `getNextUrl()` method, note that it is no longer needed by the library and will not be called. The `getNextRequest()` method now fulfills its original purpose.
* __BREAKING__: Removed methods from `HttpLoader`:
Expand All @@ -26,6 +27,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added
* New methods `FileCache::prolong()` and `FileCache::prolongAll()` to allow prolonging the time to live for cached responses.

### Fixed
* The `maxOutputs()` method is now also available and working on `Group` steps.

## [1.10.0] - 2024-08-05
### Added
* URL refiners: `UrlRefiner::withScheme()`, `UrlRefiner::withHost()`, `UrlRefiner::withPort()`, `UrlRefiner::withoutPort()`, `UrlRefiner::withPath()`, `UrlRefiner::withQuery()`, `UrlRefiner::withoutQuery()`, `UrlRefiner::withFragment()` and `UrlRefiner::withoutFragment()`.
Expand Down
25 changes: 25 additions & 0 deletions src/Steps/BaseStep.php
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,10 @@ abstract class BaseStep implements StepInterface

protected ?string $outputKey = null;

protected ?int $maxOutputs = null;

protected int $currentOutputCount = 0;

/**
* @param Input $input
* @return Generator<Output>
Expand Down Expand Up @@ -250,9 +254,18 @@ public function outputKey(string $key): static
return $this;
}

public function maxOutputs(int $maxOutputs): static
{
$this->maxOutputs = $maxOutputs;

return $this;
}

public function resetAfterRun(): void
{
$this->uniqueOutputKeys = $this->uniqueInputKeys = [];

$this->currentOutputCount = 0;
}

/**
Expand Down Expand Up @@ -717,4 +730,16 @@ protected function getOutputKeyAliasRealKey(string $key): string

return $mapping[$key];
}

protected function maxOutputsExceeded(): bool
{
return $this->maxOutputs !== null && $this->currentOutputCount >= $this->maxOutputs;
}

protected function trackYieldedOutput(): void
{
if ($this->maxOutputs !== null) {
$this->currentOutputCount += 1;
}
}
}
21 changes: 21 additions & 0 deletions src/Steps/Group.php
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ public function addStep(StepInterface $step): self
$step->setLoader($this->loader);
}

if ($this->maxOutputs) {
$step->maxOutputs($this->maxOutputs);
}

$this->steps[] = $step;

return $this;
Expand Down Expand Up @@ -98,6 +102,17 @@ public function setLoader(LoaderInterface $loader): self
return $this;
}

public function maxOutputs(int $maxOutputs): static
{
parent::maxOutputs($maxOutputs);

foreach ($this->steps as $step) {
$step->maxOutputs($maxOutputs);
}

return $this;
}

public function outputType(): StepOutputType
{
return StepOutputType::AssociativeArrayOrObject;
Expand Down Expand Up @@ -144,6 +159,10 @@ private function addOutputToCombinedOutputs(
private function prepareCombinedOutputs(array $combinedOutputs, Input $input): Generator
{
foreach ($combinedOutputs as $combinedOutput) {
if ($this->maxOutputsExceeded()) {
break;
}

$outputData = $this->normalizeCombinedOutputs($combinedOutput);

$outputData = $this->applyRefiners($outputData, $input->get());
Expand All @@ -156,6 +175,8 @@ private function prepareCombinedOutputs(array $combinedOutputs, Input $input): G
}

yield $output;

$this->trackYieldedOutput();
}
}
}
Expand Down
27 changes: 1 addition & 26 deletions src/Steps/Step.php
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,6 @@ abstract class Step extends BaseStep

protected bool $excludeFromGroupOutput = false;

protected ?int $maxOutputs = null;

protected int $currentOutputCount = 0;

/**
* @return Generator<mixed>
*/
Expand Down Expand Up @@ -79,13 +75,6 @@ public function shouldOutputBeExcludedFromGroupOutput(): bool
return $this->excludeFromGroupOutput;
}

public function maxOutputs(int $maxOutputs): static
{
$this->maxOutputs = $maxOutputs;

return $this;
}

/**
* If the user set a callback to update the input (see above) => call it.
*/
Expand All @@ -100,13 +89,6 @@ public function callUpdateInputUsingOutput(Input $input, Output $output): Input
return $input;
}

public function resetAfterRun(): void
{
parent::resetAfterRun();

$this->currentOutputCount = 0;
}

/**
* Validate and sanitize the incoming Input object
*
Expand Down Expand Up @@ -235,17 +217,10 @@ private function invokeAndYield(mixed $validInputValue, Input $input): Generator

yield $output;

if ($this->maxOutputs !== null) {
$this->currentOutputCount += 1;
}
$this->trackYieldedOutput();
}
}

private function maxOutputsExceeded(): bool
{
return $this->maxOutputs !== null && $this->currentOutputCount >= $this->maxOutputs;
}

/**
* Sometimes there can be a so-called byte order mark character as first characters in a text file. See:
* https://stackoverflow.com/questions/53303571/why-does-the-filereader-stream-read-239-187-191-from-a-textfile
Expand Down
2 changes: 2 additions & 0 deletions src/Steps/StepInterface.php
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,7 @@ public function orWhere(string|FilterInterface $keyOrFilter, ?FilterInterface $f

public function outputKey(string $key): static;

public function maxOutputs(int $maxOutputs): static;

public function resetAfterRun(): void;
}
1 change: 0 additions & 1 deletion tests/Steps/BaseStepTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Response;
use InvalidArgumentException;

use PHPUnit\Framework\TestCase;

use function tests\helper_getInputReturningStep;
Expand Down
75 changes: 74 additions & 1 deletion tests/Steps/GroupTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ protected function invoke(mixed $input): Generator
});

test(
'When defining keys for the steps as first param in addStep call, the combined output array has those keys',
'When defining keys for the steps via $step->outputKey(), the combined output array has those keys',
function () {
$step1 = helper_getValueReturningStep('ich');

Expand Down Expand Up @@ -789,3 +789,76 @@ function () {
expect($outputs[0]->get())->toBe(['yo' => 'lo']);
},
);

it('stops calling its steps and producing outputs when maxOutputs is reached', function () {
$step1 = new class extends Step {
public int $called = 0;

protected function invoke(mixed $input): Generator
{
yield ['foo' => 'one'];

$this->called++;
}
};

$step2 = new class extends Step {
public int $called = 0;

protected function invoke(mixed $input): Generator
{
yield ['bar' => 'two'];

$this->called++;
}
};

$group = (new Group())
->addStep($step1)
->addStep($step2)
->maxOutputs(2);

expect(helper_invokeStepWithInput($group, 'hey'))->toHaveCount(1)
->and(helper_invokeStepWithInput($group, 'ho'))->toHaveCount(1)
->and(helper_invokeStepWithInput($group, 'hey'))->toHaveCount(0)
->and($step1->called)->toBe(2)
->and($step2->called)->toBe(2);
});

it(
'also stops creating outputs when maxOutputs is reached, when maxOutputs() was called before addStep()',
function () {
$step1 = new class extends Step {
public int $called = 0;

protected function invoke(mixed $input): Generator
{
yield ['foo' => 'one'];

$this->called++;
}
};

$step2 = new class extends Step {
public int $called = 0;

protected function invoke(mixed $input): Generator
{
yield ['bar' => 'two'];

$this->called++;
}
};

$group = (new Group())
->maxOutputs(2)
->addStep($step1)
->addStep($step2);

expect(helper_invokeStepWithInput($group, 'hey'))->toHaveCount(1)
->and(helper_invokeStepWithInput($group, 'ho'))->toHaveCount(1)
->and(helper_invokeStepWithInput($group, 'hey'))->toHaveCount(0)
->and($step1->called)->toBe(2)
->and($step2->called)->toBe(2);
},
);
1 change: 0 additions & 1 deletion tests/Steps/HtmlTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Html\GetLink;
use Crwlr\Crawler\Steps\Html\GetLinks;

use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Response;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
use Crwlr\Crawler\Steps\Loading\Http\Paginators\SimpleWebsitePaginator;
use GuzzleHttp\Psr7\Response;
use PHPUnit\Framework\TestCase;

use Psr\Http\Message\RequestInterface;

use function tests\helper_getRespondedRequest;
Expand Down
1 change: 0 additions & 1 deletion tests/Steps/StepTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@
use GuzzleHttp\Psr7\Response;
use InvalidArgumentException;
use PHPUnit\Framework\TestCase;

use stdClass;

use function tests\helper_getInputReturningStep;
Expand Down
1 change: 0 additions & 1 deletion tests/Steps/XmlTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Xml;

use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Response;

Expand Down
1 change: 0 additions & 1 deletion tests/_Integration/GroupTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

use Psr\Log\LoggerInterface;

use function tests\helper_generatorToArray;
Expand Down
1 change: 0 additions & 1 deletion tests/_Integration/Http/HeadlessBrowserTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Generator;
use Psr\Log\LoggerInterface;

use Symfony\Component\DomCrawler\Crawler;

use function tests\helper_generatorToArray;
Expand Down
1 change: 0 additions & 1 deletion tests/_Integration/Http/Html/PaginatedListingTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

use Psr\Log\LoggerInterface;

use function tests\helper_generatorToArray;
Expand Down
1 change: 0 additions & 1 deletion tests/_Integration/Http/Html/SimpleListingTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

use Psr\Log\LoggerInterface;

use function tests\helper_generatorToArray;
Expand Down