1.1.0
Code
- Default throttling for downloaders set to max 300 requests per second.
Downloader
now takes a client for downloading, currently there exists two clients:
- s3 -> Directly queries the common crawl buckets
- api -> Quries CommonCrawl API Gateway
- Retry system has been updated to leverage tenacity, additionaly we now use random exponential random backoff instead of linear random backoff
CLI
- New global parameter
--aws_profile
for setting an aws_profile to use - New parameter
--download_method
which can be set for
extract...records --download_method
download...html --download_method
In both cases the argument can be set to either s3 or api, which definies how the commoncrawl will be accessed when downloading warc files.