All notable changes to this project will be documented in this file.
- Mask Playwright's "headless" headers (#545, closes #401) (d1445e4) by @vdusek
- Add new model for
HttpHeaders
(#544) (854f2c1) by @vdusek
- Call
error_handler
forSessionError
(#557, closes #546) (e75ac4b) by @vdusek - Extend from
StrEnum
inRequestState
to fix serialization (#556, closes #551) (6bf35ba) by @vdusek - Add equality check to UserData model (#562) (899a25c) by @janbuchar
0.3.7 (2024-09-25)
- Improve
Request.user_data
serialization (#540, closes #524) (de29c0e) by @janbuchar - Adopt new version of curl-cffi (#543) (f6fcf48) by @vdusek
0.3.6 (2024-09-19)
- Add HTTP/2 support for HTTPX client (#513, closes #512) (0eb0a33) by @vdusek
- Expose extended unique key when creating a new Request (#515) (1807f41) by @vdusek
- Add header generator and integrate it into HTTPX client (#530, closes #402) (b63f9f9) by @vdusek
0.3.5 (2024-09-10)
- Memory usage limit configuration via environment variables (#502) (c62e554) by @janbuchar
- Http clients detect 4xx as errors by default (#498, closes #496) (1895dca) by @vdusek
- Correctly handle log level configuration (#508) (7ea8fe6) by @janbuchar
0.3.4 (2024-09-05)
0.3.3 (2024-09-05)
- Deduplicate requests by unique key before submitting them to the queue (#499) (6a3e0e7) by @janbuchar
0.3.2 (2024-09-02)
- Double incrementation of
item_count
(#443, closes #442) (cd9adf1) by @cadlagtrader - Field alias in
BatchRequestsOperationResponse
(#485) (126a862) by @janbuchar - JSON handling with Parsel (#490, closes #488) (ebf5755) by @janbuchar
0.3.1 (2024-08-30)
0.3.0 (2024-08-27)
- Implement ParselCrawler that adds support for Parsel (#348, closes #335) (a3832e5) by @asymness
- Add support for filling a web form (#453, closes #305) (5a125b4) by @vdusek
- Remove indentation from statistics logging and print the data in tables (#322, closes #306) (359b515) by @TymeeK
- Remove redundant log, fix format (#408) (8d27e39) by @janbuchar
- Dequeue items from RequestQueue in the correct order (#411) (96fc33e) by @janbuchar
- Relative URLS supports & If not a URL, pass #417 (#431, closes #417) (ccd8145) by @black7375
- Typo in ProlongRequestLockResponse (#458) (30ccc3a) by @janbuchar
- Add missing all to top-level init.py file (#463) (353a1ce) by @janbuchar
- [breaking] RequestQueue and service management rehaul (#429, closes #83, #174, #203, #423) (b155a9f) by @janbuchar
- [breaking] Declare private and public interface (#456) (d6738df) by @vdusek
0.2.1 (2024-08-05)
0.2.0 (2024-08-05)
- Add new curl impersonate HTTP client (#387, closes #292) (9c06260) by @vdusek
- (playwright)
infinite_scroll
helper (#393) (34f74bd) by @janbuchar
0.1.2 (2024-07-30)
- Minor log fix (#341) (0688bf1) by @souravjain540
- Also use error_handler for context pipeline errors (#331, closes #296) (7a66445) by @janbuchar
- Strip whitespace from href in enqueue_links (#346, closes #337) (8a3174a) by @janbuchar
- Warn instead of crashing when an empty dataset is being exported (#342, closes #334) (22b95d1) by @janbuchar
- Avoid Github rate limiting in project bootstrapping test (#364) (992f07f) by @janbuchar
- Pass crawler configuration to storages (#375) (b2d3a52) by @janbuchar
- Purge request queue on repeated crawler runs (#377, closes #152) (7ad3d69) by @janbuchar
0.1.1 (2024-07-19)
- Expose crawler log (#316, closes #303) (ae475fa) by @vdusek
- Integrate proxies into
PlaywrightCrawler
(#325) (2e072b6) by @vdusek - Blocking detection for playwright crawler (#328, closes #239) (49ff6e2) by @vdusek
- Pylance reportPrivateImportUsage errors (#313, closes #283) (09d7203) by @vdusek
- Set httpx logging to warning (#314, closes #302) (1585def) by @vdusek
- Byte size serialization in MemoryInfo (#245) (a030174) by @janbuchar
- Project bootstrapping in existing folder (#318, closes #301) (c630818) by @janbuchar
0.1.0 (2024-07-08)
- Project templates (#237, closes #215) (c23c12c) by @janbuchar
- CLI UX improvements (#271, closes #267) (123d515) by @janbuchar
- Error handling in CLI and templates documentation (#273, closes #268) (61083c3) by @vdusek
0.0.7 (2024-06-27)
- Do not wait for consistency in request queue (#235) (03ff138) by @vdusek
- Selector handling in BeautifulSoupCrawler enqueue_links (#231, closes #230) (896501e) by @janbuchar
- Handle blocked request (#234) (f8ef79f) by @Mantisus
- Improve AutoscaledPool state management (#241, closes #236) (fdea3d1) by @janbuchar
0.0.6 (2024-06-25)
- Maintain a global configuration instance (#207) (e003aa6) by @janbuchar
- Add max requests per crawl to
BasicCrawler
(#198) (b5b3053) by @vdusek - Add support decompress br response content (#226) (a3547b9) by @Mantisus
- BasicCrawler.export_data helper (#222, closes #211) (237ec78) by @janbuchar
- Automatic logging setup (#229, closes #214) (a67b72f) by @janbuchar
- Handling of relative URLs in add_requests (#213, closes #202, #204) (8aa8c57) by @janbuchar
- Graceful exit in BasicCrawler.run (#224, closes #212) (337286e) by @janbuchar
0.0.5 (2024-06-21)
- Browser rotation and better browser abstraction (#177, closes #131) (a42ae6f) by @vdusek
- Add emit persist state event to event manager (#181) (97f6c68) by @vdusek
- Batched request addition in RequestQueue (#186) (f48c806) by @vdusek
- Add storage helpers to crawler & context (#192, closes #98, #100, #172) (f8f4066) by @vdusek
- Handle all supported configuration options (#199, closes #84) (23c901c) by @janbuchar
- Add Playwright's enqueue links helper (#196) (849d73c) by @vdusek
- Tmp path in tests is working (#164, closes #159) (382b6f4) by @vdusek
- Add explicit err msgs for missing pckg extras during import (#165, closes #155) (200ebfa) by @vdusek
- Make timedelta_ms accept string-encoded numbers (#190) (d8426ff) by @janbuchar
- (deps) Update dependency psutil to v6 (#193) (eb91f51) by @renovate[bot]
- Improve compatibility between ProxyConfiguration and its SDK counterpart (#201) (1a76124) by @janbuchar
- Correct return type of storage get_info methods (#200) (332673c) by @janbuchar
- Type error in statistics persist state (#206, closes #194) (96ceef6) by @vdusek
0.0.4 (2024-05-30)
- Capture statistics about the crawler run (#142, closes #97) (eeebe9b) by @janbuchar
- Proxy configuration (#156, closes #136) (5c3753a) by @janbuchar
- Add first version of browser pool and playwright crawler (#161) (2d2a050) by @vdusek
0.0.3 (2024-05-13)
- AutoscaledPool implementation (#55, closes #19) (621ada2) by @janbuchar
- Add Snapshotter (#20) (492ee38) by @vdusek
- Implement BasicCrawler (#56, closes #30) (6da971f) by @janbuchar
- BeautifulSoupCrawler (#107, closes #31) (4974dfa) by @janbuchar
- Add_requests and enqueue_links context helpers (#120, closes #5) (dc850a5) by @janbuchar
- Use SessionPool in BasicCrawler (#128, closes #110) (9fc4648) by @janbuchar
- Add base storage client and resource subclients (#138) (44d6597) by @vdusek
- (deps) Update dependency docutils to ^0.21.0 (#101) (534b613) by @renovate[bot]
- (deps) Update dependency eval-type-backport to ^0.2.0 (#124) (c9e69a8) by @renovate[bot]
- Fire local SystemInfo events every second (#144) (f1359fa) by @vdusek
- Storage manager & purging the defaults (#150) (851042f) by @vdusek