All notable changes to this project will be documented in this file.
0.5.1 (2025-01-07)
- Make result of RequestList.is_empty independent of fetch_next_request calls (#876) (d50249e) by @janbuchar
0.5.0 (2025-01-02)
- Add possibility to use None as no proxy in tiered proxies (#760) (0fbd017) by @Pijukatel, closes #687
- Add
use_state
context method (#682) (868b41e) by @Mantisus, closes #191 - Add pre-navigation hooks router to AbstractHttpCrawler (#791) (0f23205) by @Pijukatel, closes #635
- Add example of how to integrate Camoufox into PlaywrightCrawler (#789) (246cfc4) by @Pijukatel, closes #684
- Expose event types, improve on/emit signature, allow parameterless listeners (#800) (c102c4c) by @janbuchar, closes #561
- Add stop method to BasicCrawler (#807) (6d01af4) by @Pijukatel, closes #651
- Add
html_to_text
helper function (#792) (2b9d970) by @Pijukatel, closes #659 - [breaking] Implement
RequestManagerTandem
, removeadd_request
fromRequestList
, accept any iterable inRequestList
constructor (#777) (4172652) by @janbuchar
- Fix circular import in
KeyValueStore
(#805) (8bdf49d) by @Mantisus, closes #804 - [breaking] Refactor service usage to rely on
service_locator
(#691) (1d31c6c) by @vdusek, closes #369, #539, #699 - Pass
verify
in httpx client (#802) (074d083) by @Mantisus, closes #798 - Fix
page_options
forPlaywrightBrowserPlugin
(#796) (bd3bdd4) by @Mantisus, closes #755 - Fix event migrating handler in
RequestQueue
(#825) (fd6663f) by @Mantisus, closes #815 - Respect user configuration for work with status codes (#812) (8daf4bd) by @Mantisus, closes #708, #756
abort-on-error
for successive runs (#834) (0cea673) by @Mantisus- Relax ServiceLocator restrictions (#837) (aa3667f) by @janbuchar, closes #806
- Fix typo in exports (#841) (8fa6ac9) by @janbuchar
- [breaking] Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance (#746) (9d3c269) by @Pijukatel, closes #350
- [breaking] Remove
json_
andorder_no
fromRequest
(#788) (5381d13) by @Mantisus, closes #94 - [breaking] Rename PwPreNavContext to PwPreNavCrawlingContext (#827) (84b61a3) by @vdusek
- [breaking] Rename PlaywrightCrawler kwargs: browser_options, page_options (#831) (ffc6048) by @Pijukatel
- [breaking] Update the crawlers & storage clients structure (#828) (0ba04d1) by @vdusek, closes #764
0.4.5 (2024-12-06)
- Add upper bound of HTTPX version (#775) (b59e34d) by @vdusek
- Fix incorrect use of desired concurrency ratio (#780) (d1f8bfb) by @Pijukatel, closes #759
- Remove pydantic constraint <2.10.0 and update timedelta validator, serializer type hints (#757) (c0050c0) by @Pijukatel
0.4.4 (2024-11-29)
- Expose browser_options and page_options to PlaywrightCrawler (#730) (dbe85b9) by @vdusek, closes #719
- Add
abort_on_error
property (#731) (6dae03a) by @Mantisus, closes #704
0.4.3 (2024-11-21)
- Pydantic 2.10.0 issues (#716) (8d8b3fc) by @Pijukatel
0.4.2 (2024-11-20)
- Respect custom HTTP headers in
PlaywrightCrawler
(#685) (a84125f) by @Mantisus - Fix serialization payload in Request. Fix Docs for Post Request (#683) (e8b4d2d) by @Mantisus, closes #668
- Accept string payload in the Request constructor (#697) (19f5add) by @vdusek
- Fix snapshots handling (#692) (4016c0d) by @Pijukatel
0.4.1 (2024-11-11)
- Add
max_crawl_depth
option toBasicCrawler
(#637) (77deaa9) by @Prathamesh010, closes #460 - Add BeautifulSoupParser type alias (#674) (b2cf88f) by @Pijukatel
- Fix total_size usage in memory size monitoring (#661) (c2a3239) by @janbuchar
- Add HttpHeaders to module exports (#664) (f0c5ca7) by @vdusek, closes #663
- Fix unhandled ValueError in request handler result processing (#666) (0a99d7f) by @janbuchar
- Fix BaseDatasetClient.iter_items type hints (#680) (a968b1b) by @Pijukatel
0.4.0 (2024-11-01)
- [breaking] Add headers in unique key computation (#609) (6c4746f) by @Prathamesh010, closes #548
- Add
pre_navigation_hooks
toPlaywrightCrawler
(#631) (5dd5b60) by @Prathamesh010, closes #427 - Add
always_enqueue
option to bypass URL deduplication (#621) (4e59fa4) by @Rutam21, closes #547 - Split and add extra configuration to export_data method (#580) (6751635) by @deshansh, closes #526
- Use strip in headers normalization (#614) (a15b21e) by @vdusek
- [breaking] Merge payload and data fields of Request (#542) (d06fcef) by @vdusek, closes #560
- Default ProxyInfo port if httpx.URL port is None (#619) (8107a6f) by @steffansafey, closes #618
0.3.9 (2024-10-23)
- Key-value store context helpers (#584) (fc15622) by @janbuchar
- Added get_public_url method to KeyValueStore (#572) (3a4ba8f) by @akshay11298, closes #514
- Workaround for JSON value typing problems (#581) (403496a) by @janbuchar, closes #563
0.3.8 (2024-10-02)
- Mask Playwright's "headless" headers (#545) (d1445e4) by @vdusek, closes #401
- Add new model for
HttpHeaders
(#544) (854f2c1) by @vdusek
- Call
error_handler
forSessionError
(#557) (e75ac4b) by @vdusek, closes #546 - Extend from
StrEnum
inRequestState
to fix serialization (#556) (6bf35ba) by @vdusek, closes #551 - Add equality check to UserData model (#562) (899a25c) by @janbuchar
0.3.7 (2024-09-25)
- Improve
Request.user_data
serialization (#540) (de29c0e) by @janbuchar, closes #524 - Adopt new version of curl-cffi (#543) (f6fcf48) by @vdusek
0.3.6 (2024-09-19)
- Add HTTP/2 support for HTTPX client (#513) (0eb0a33) by @vdusek, closes #512
- Expose extended unique key when creating a new Request (#515) (1807f41) by @vdusek
- Add header generator and integrate it into HTTPX client (#530) (b63f9f9) by @vdusek, closes #402
0.3.5 (2024-09-10)
- Memory usage limit configuration via environment variables (#502) (c62e554) by @janbuchar
- Http clients detect 4xx as errors by default (#498) (1895dca) by @vdusek, closes #496
- Correctly handle log level configuration (#508) (7ea8fe6) by @janbuchar
0.3.4 (2024-09-05)
0.3.3 (2024-09-05)
- Deduplicate requests by unique key before submitting them to the queue (#499) (6a3e0e7) by @janbuchar
0.3.2 (2024-09-02)
- Double incrementation of
item_count
(#443) (cd9adf1) by @cadlagtrader, closes #442 - Field alias in
BatchRequestsOperationResponse
(#485) (126a862) by @janbuchar - JSON handling with Parsel (#490) (ebf5755) by @janbuchar, closes #488
0.3.1 (2024-08-30)
0.3.0 (2024-08-27)
- Implement ParselCrawler that adds support for Parsel (#348) (a3832e5) by @asymness, closes #335
- Add support for filling a web form (#453) (5a125b4) by @vdusek, closes #305
- Remove indentation from statistics logging and print the data in tables (#322) (359b515) by @TymeeK, closes #306
- Remove redundant log, fix format (#408) (8d27e39) by @janbuchar
- Dequeue items from RequestQueue in the correct order (#411) (96fc33e) by @janbuchar
- Relative URLS supports & If not a URL, pass #417 (#431) (ccd8145) by @black7375, closes #417
- Typo in ProlongRequestLockResponse (#458) (30ccc3a) by @janbuchar
- Add missing all to top-level init.py file (#463) (353a1ce) by @janbuchar
- [breaking] RequestQueue and service management rehaul (#429) (b155a9f) by @janbuchar, closes #83, #174, #203, #423
- [breaking] Declare private and public interface (#456) (d6738df) by @vdusek
0.2.1 (2024-08-05)
0.2.0 (2024-08-05)
- Add new curl impersonate HTTP client (#387) (9c06260) by @vdusek, closes #292
- playwright:
infinite_scroll
helper (#393) (34f74bd) by @janbuchar
0.1.2 (2024-07-30)
- Minor log fix (#341) (0688bf1) by @souravjain540
- Also use error_handler for context pipeline errors (#331) (7a66445) by @janbuchar, closes #296
- Strip whitespace from href in enqueue_links (#346) (8a3174a) by @janbuchar, closes #337
- Warn instead of crashing when an empty dataset is being exported (#342) (22b95d1) by @janbuchar, closes #334
- Avoid Github rate limiting in project bootstrapping test (#364) (992f07f) by @janbuchar
- Pass crawler configuration to storages (#375) (b2d3a52) by @janbuchar
- Purge request queue on repeated crawler runs (#377) (7ad3d69) by @janbuchar, closes #152
0.1.1 (2024-07-19)
- Expose crawler log (#316) (ae475fa) by @vdusek, closes #303
- Integrate proxies into
PlaywrightCrawler
(#325) (2e072b6) by @vdusek - Blocking detection for playwright crawler (#328) (49ff6e2) by @vdusek, closes #239
- Pylance reportPrivateImportUsage errors (#313) (09d7203) by @vdusek, closes #283
- Set httpx logging to warning (#314) (1585def) by @vdusek, closes #302
- Byte size serialization in MemoryInfo (#245) (a030174) by @janbuchar
- Project bootstrapping in existing folder (#318) (c630818) by @janbuchar, closes #301
0.1.0 (2024-07-08)
- Project templates (#237) (c23c12c) by @janbuchar, closes #215
- CLI UX improvements (#271) (123d515) by @janbuchar, closes #267
- Error handling in CLI and templates documentation (#273) (61083c3) by @vdusek, closes #268
0.0.7 (2024-06-27)
- Do not wait for consistency in request queue (#235) (03ff138) by @vdusek
- Selector handling in BeautifulSoupCrawler enqueue_links (#231) (896501e) by @janbuchar, closes #230
- Handle blocked request (#234) (f8ef79f) by @Mantisus
- Improve AutoscaledPool state management (#241) (fdea3d1) by @janbuchar, closes #236
0.0.6 (2024-06-25)
- Maintain a global configuration instance (#207) (e003aa6) by @janbuchar
- Add max requests per crawl to
BasicCrawler
(#198) (b5b3053) by @vdusek - Add support decompress br response content (#226) (a3547b9) by @Mantisus
- BasicCrawler.export_data helper (#222) (237ec78) by @janbuchar, closes #211
- Automatic logging setup (#229) (a67b72f) by @janbuchar, closes #214
- Handling of relative URLs in add_requests (#213) (8aa8c57) by @janbuchar, closes #202, #204
- Graceful exit in BasicCrawler.run (#224) (337286e) by @janbuchar, closes #212
0.0.5 (2024-06-21)
- Browser rotation and better browser abstraction (#177) (a42ae6f) by @vdusek, closes #131
- Add emit persist state event to event manager (#181) (97f6c68) by @vdusek
- Batched request addition in RequestQueue (#186) (f48c806) by @vdusek
- Add storage helpers to crawler & context (#192) (f8f4066) by @vdusek, closes #98, #100, #172
- Handle all supported configuration options (#199) (23c901c) by @janbuchar, closes #84
- Add Playwright's enqueue links helper (#196) (849d73c) by @vdusek
- Tmp path in tests is working (#164) (382b6f4) by @vdusek, closes #159
- Add explicit err msgs for missing pckg extras during import (#165) (200ebfa) by @vdusek, closes #155
- Make timedelta_ms accept string-encoded numbers (#190) (d8426ff) by @janbuchar
- deps: Update dependency psutil to v6 (#193) (eb91f51) by @renovate[bot]
- Improve compatibility between ProxyConfiguration and its SDK counterpart (#201) (1a76124) by @janbuchar
- Correct return type of storage get_info methods (#200) (332673c) by @janbuchar
- Type error in statistics persist state (#206) (96ceef6) by @vdusek, closes #194
0.0.4 (2024-05-30)
- Capture statistics about the crawler run (#142) (eeebe9b) by @janbuchar, closes #97
- Proxy configuration (#156) (5c3753a) by @janbuchar, closes #136
- Add first version of browser pool and playwright crawler (#161) (2d2a050) by @vdusek
0.0.3 (2024-05-13)
- AutoscaledPool implementation (#55) (621ada2) by @janbuchar, closes #19
- Add Snapshotter (#20) (492ee38) by @vdusek
- Implement BasicCrawler (#56) (6da971f) by @janbuchar, closes #30
- BeautifulSoupCrawler (#107) (4974dfa) by @janbuchar, closes #31
- Add_requests and enqueue_links context helpers (#120) (dc850a5) by @janbuchar, closes #5
- Use SessionPool in BasicCrawler (#128) (9fc4648) by @janbuchar, closes #110
- Add base storage client and resource subclients (#138) (44d6597) by @vdusek
- deps: Update dependency docutils to ^0.21.0 (#101) (534b613) by @renovate[bot]
- deps: Update dependency eval-type-backport to ^0.2.0 (#124) (c9e69a8) by @renovate[bot]
- Fire local SystemInfo events every second (#144) (f1359fa) by @vdusek
- Storage manager & purging the defaults (#150) (851042f) by @vdusek