Skip to content

Releases: crwlrsoft/crawler

v3.0.4

17 Dec 23:12
Compare
Choose a tag to compare

Fixed

  • Minor improvement for the DomQuery (base for Dom::cssSelector() and Dom::xPath()): enable providing an empty string as selector, to simply get the node that the selector is applied to.

v3.0.3

11 Dec 12:50
4904ed4
Compare
Choose a tag to compare

Fixed

  • Improved fix for non UTF-8 characters in HTML documents declared as UTF-8.

v3.0.2

11 Dec 11:09
Compare
Choose a tag to compare

Fixed

  • When the new PHP 8.4 DOM API is used, and HTML declared as UTF-8 contains non UTF-8 compatible characters, it does not replace them with a � character, but instead removes it. This behaviour is consistent with the data returned by Symfony DomCrawler.

v3.0.1

09 Dec 23:55
Compare
Choose a tag to compare

Undeprecated

  • Removed deprecations for all XPath functionality (Dom::xPath(), XPathQuery class and Node::queryXPath()), because it's still available with the net DOM API in PHP 8.4.

v3.0.0

08 Dec 22:45
Compare
Choose a tag to compare

The primary change in version 3.0.0 is that the library now leverages PHP 8.4’s new DOM API when used in an environment with PHP >= 8.4. To maintain compatibility with PHP < 8.4, an abstraction layer has been implemented. This layer dynamically uses either the Symfony DomCrawler component or the new DOM API, depending on the PHP version.

Since no direct interaction with an instance of the Symfony DomCrawler library was required at the step level provided by the library, it is highly likely that you won’t need to make any changes to your code to upgrade to v3. To ensure a smooth transition, please review the points under “Changed.”

If you're using XPath queries for data extraction, please try to switch to using CSS selectors instead, because XPath is no longer supported by the new DOM API. Therefor XPath related functionality was deprecated in this version of the library and will probably be removed in the next major version.

Changed

  • BREAKING: The DomQuery::innerText() method (a.k.a. Dom::cssSelector('...')->innerText()) has been removed. innerText exists only in the Symfony DomCrawler component, and its usefulness is questionable. If you still require this variant of the DOM element text, please let us know or create a pull request yourself. Thank you!
  • BREAKING: The DomQueryInterface was removed. As the DomQuery class offers a lot more functionality than the interface defines, the purpose of the interface was questionable. Please use the abstract DomQuery class instead. This also means that some method signatures, type hinting the interface, have changed. Look for occurences of DomQueryInterface and replace them.
  • BREAKING: The visibility of the DomQuery::filter() method was changed from public to protected. It is still needed in the DomQuery class, but outside of it, it is probably better and easier to directly use the new DOM abstraction (see the src/Steps/Dom directory). If you are extending the DomQuery class (which is not recommended), be aware that the argument now takes a Node (from the new DOM abstraction) instead of a Symfony Crawler.
  • BREAKING: The Step::validateAndSanitizeToDomCrawlerInstance() method was removed. Please use the Step::validateAndSanitizeToHtmlDocumentInstance() and Step::validateAndSanitizeToXmlDocumentInstance() methods instead.
  • BREAKING: The second argument in Closures passed to the Http::crawl()->customFilter() has changed from an instance of Symfony Crawler class, to an HtmlElement instance from the new DOM abstraction (Crwlr\Crawler\Steps\Dom\HtmlElement).
  • BREAKING: The Filter class was split into AbstractFilter (base class for actual filter classes) and Filter only hosting the static function for easy instantiation, because otherwise each filter class also has all the static methods.
  • BREAKING: Further, the signatures of some methods that are mainly here for internal usage, have changed due to the new DOM abstraction:
    • The static GetLink::isSpecialNonHttpLink() method now needs an instance of HtmlElement instead of a Symfony Crawler.
    • GetUrlsFromSitemap::fixUrlSetTag() now takes an XmlDocument instead of a Symfony Crawler.
    • The DomQuery::apply() method now takes a Node instead of a Symfony Crawler.

Deprecated

  • Dom::xPath() method and
  • the XPathQuery class as well as
  • the new Node::queryXPath() method.

Added

  • New step output filter Filter::arrayHasElement(). When a step produces array output with a property being a numeric array, you can now filter outputs by checking if one element of that array property, matches certain filter criteria. Example: The outputs look like ['foo' => 'bar', 'baz' => ['one', 'two', 'three']]. You can filter all outputs where baz contains two like: Filter::arrayHasElement()->where('baz', Filter::equal('two')).

v2.1.3

05 Nov 20:00
Compare
Choose a tag to compare

Fixed

  • Improvements for deprecations in PHP 8.4.

v2.1.2

22 Oct 10:12
Compare
Choose a tag to compare

Fixed

  • Issue when converting cookie objects received from the chrome-php library. (Thx to @szepeviktor #168)

v2.1.1

21 Oct 13:49
Compare
Choose a tag to compare

Fixed

  • Also add cookies, set during headless browser usage, to the cookie jar. When switching back to the (guzzle) HTTP client the cookies should also be sent.
  • Don't call Loader::afterLoad() when Loader::beforeLoad() was not called before. This can potentially happen, when an exception is thrown before the call to the beforeLoad hook, but it is caught and the afterLoader hook method is called anyway. As this most likely won't make sense to users, the afterLoad hook callback functions will just not be called in this case.
  • The Throttler class now has protected methods _internalTrackStartFor(), _requestToUrlWasStarted() and _internalTrackEndFor(). When extending the Throttler class (be careful, actually that's not really recommended) they can be used to check if a request to a URL was actually started before.

v2.1.0

18 Oct 22:36
Compare
Choose a tag to compare

Added

  • The new postBrowserNavigateHook() method in the Http step classes, which allows to define callback functions that are triggered after the headless browser navigated to the specified URL. They are called with the chrome-php Page object as argument, so you can interact with the page. Also, there is a new class BrowserAction providing some simple actions (like wait for element, click element,...) as Closures via static methods. You can use it like Http::get()->postBrowserNavigateHook(BrowserAction::clickElement('#element')).

v2.0.1

15 Oct 20:06
Compare
Choose a tag to compare

Fixed

  • Issue with the afterLoad hook of the HttpLoader, introduced in v2. Calling the hook was commented out, which slipped through because the test case was faulty.