Releases: spider-rs/spider
v2.16.0
Whats Changed
- Chrome crawls now get the total bytes used over the network.
- Improved ignore list for unwanted crawling request for chrome interception.
Full Changelog: v2.15.0...v2.16.0
v2.15.0
Whats Changed
Major possible performance increase for chrome crawling blocking extra unwanted XHR request and scripts.
- perf(chrome): add xhr interception
Full Changelog: v2.14.0...v2.15.0
v2.14.0
Release Notes
Features
- feat(transform): add
transform_content_send
for async streaming.
Improvements
- chore(interning): add optional string-interning.
- chore(website): fix crawl, establish domain removal [#233].
- chore(transform): add streaming markdown/commonmark transforming.
- chore(transform): add streaming text transforming.
- chore(chrome): add request interception analytics ignore.
Bug Fixes
- chore(page): fix URL encode handling mismatch.
- chore(transform): fix repeated text streaming.
- chore(page): fix page link return with full URLs.
- chore(website): fix crawl delay handling.
- perf(website): reduce extra context switching on crawls.
Thank you for the help @Revertron!
Full Changelog: v2.13.78...v2.14.0
v2.13.78
Whats Changed
- Fix infinite loop with backoff Gateway retries
- Fix limit handling break
Full Changelog: v2.13.64...v2.13.78
v2.13.64
Whats Changed
Major fixes for critical bugs that can hang the process.
- perf reduce cpu usage for streaming rewriter
- fix hang on iteration streaming
- fix chrome connection hang
- fix cache backend default build
- fix domain absolute link join
- fix shutdown break loop
- add ignore protocol list
Full Changelog: v2.12.12...v2.13.64
v2.12.12
Fix smart mode re-rendering and performance
- fix smart mode re-rendering inline js detection
- perf improve smart mode parsing
- fix encoding smart mode html
- add pin html pre-parsing
- add chrome status code check for performing full actions
Full Changelog: v2.11.20...v2.12.12
v2.11.20
Whats Changed
Major performance improvement on crawling processing pending tasks concurrently. Now you can get all Next.js SSG pages on initial crawl for websites that do not expose links and have dynamic event listeners for routing.
- fix loop blocking tasks
- improve crawl performance processing tasks concurrent
- fix page absolute link joining
- add wait_for_dom to target element updates chrome
- add alert polyfill blocking prevention
- add missing chrome navigate request timeout for http future
- add ignore assets when crawling http
- add with_block_assets builder config for Server response non html
- perf(chrome): add skip other resources
- feat(page): add nextjs build ssg path handling
Full Changelog: v2.11.0...v2.11.20
v2.10.27
Whats Changed
- fix protocol handling valid links to crawl
- fix subdomains and tld handling matching
- add empty server response retry
- add initial request status code storing
- fix auto-encoding detection for html
- fix openai compile and fs compile
- add layui ui js frameworks and smartmode handling jquery
- chore(transforms): add optional ignore tags
- chore(budget): fix whitelist/blacklist budgeting
- chore(smart): fix whitelist/blacklist establish
- chore(openai): add json_schema option gpt configs
Full Changelog: v2.10.6...v2.10.27
v2.10.6
Whats Changed
- add html lang auto encoding handling to improve detection
- add
exclude_selector
androot_selector
transformations output formats - add bin file handling to prevent SOF transformations
- chore(chrome): fix window navigator stealth handling
- chore: fix subdomains and tld handling
- chore(chrome): add automation all routes handling
Full Changelog: v2.9.15...v2.10.6
v2.9.15
Whats Changed
- add XPath data extraction support
spider_utils
- add XML return format for
spider_transformations
- chore(transformations): add root selector across formats #219
Example getting data via xpath.
let map = QueryCSSMap::from([(
"list",
QueryCSSSelectSet::from(["//*[@class='list']"]),
)]);
let data = css_query_select_map_streamed(
r#"<html><body><ul class="list"><li>Test</li></ul></body></html>"#,
&build_selectors(map),
)
.await;
assert!(!data.is_empty(), "Xpath extraction failed");
Full Changelog: v2.8.28...v2.9.15