- Add -l/--log-directory option to add logs directory to WACZ
- include request cookie in cdxj via 'req.http:cookie' field (#27)
- fix Click dependency version
- wacz zip write: ensure zip file is fully closed on exit (fixes #20
- ci: add ci for py3.10
- wacz create: support --url, --detect-pages and --split-seeds to write detect pages to extraPages.jsonl, specified seed to pages.jsonl
- text extract: don't raise exception, keep parsed text
- Pages: also ignore pages with invalid utf-8 encoding
- Pages: read pages line by line in case of large pages file
- Pages: Better page parsing fix, more lenient on page parsing errors: print error and continue, ignoring invalid page
- Pages: Fix parsing of page URLs that contain extra ':'
- More efficient hash computation
- Add support for signing and verification!
- Ensure passed in pages are check via both http and https URLs
- Update to cdxj-indexer 1.4.1, supporting improved indexing of JSON POST requests
- Add
name
field toresources
for better compatibility with frictionless spec.
Improved compatibility with frictionless data spec
- Top-level
title
,description
,created
,software
fields and optionalmainPageURL
andmainPageTS
fields. - Include full WARC record digests in
recordDigest
field in CDX,digest
in IDX - Support for
pages/extraPages.jsonl
passed in via --extra-pages/-e flag