This is a CLI tool to efficiently download a local copy of the pwned password hash data from the very awesome HIBP pwned passwords api-endpoint using all the good bits; multiprocessing, async-processes, local-caching, content-etags and http2-connection pooling to probably make things as fast as is Pythonly possible.
- Interface to directly
query
for compromised password values from the compressed file data-store! - Download and store acquired data in gzip'd compressed to save on storage and speed up queries.
- Download the full dataset in under 45 mins (generally CPU bound)
- Easily resume interrupted
download
operations into a--data-path
without re-clobbering api-source. - Only download hash-prefix content blocks when the source content has changed (via content ETAG values); making it easy to periodically sync-up when needed.
- Query interface performance is efficient enough to attach a user web-service with reasonable loads (ie don't waste your own resources decompressing the dataset and storing in a database!)
- Ability to generate a single text file with in-order pwned password hash values, similar to PwnedPasswordsDownloader from the awesome HIBP team.
- Per prefix file metadata in JSON format for easy data reuse by other tooling if required.
pipx install hibp-downloader
Sample download activity log; host with 32 cores on 500Mbit/s connection.
...
2024-05-16T10:18:01-0400 | INFO | hibp-downloader | prefix=f80c7 source=[lc:13616 et:3 rc:1002358 ro:25 xx:1] processed=[17836.6MB ~414462H/s] api=[918req/s 17597.4MB] runtime=36.4min
2024-05-16T10:18:02-0400 | INFO | hibp-downloader | prefix=f81af source=[lc:13616 et:3 rc:1002558 ro:25 xx:1] processed=[17840.1MB ~414454H/s] api=[918req/s 17600.9MB] runtime=36.4min
2024-05-16T10:18:02-0400 | INFO | hibp-downloader | prefix=f826f source=[lc:13616 et:3 rc:1002758 ro:25 xx:1] processed=[17843.6MB ~414454H/s] api=[918req/s 17604.4MB] runtime=36.4min
2024-05-16T10:18:03-0400 | INFO | hibp-downloader | prefix=f833f source=[lc:13616 et:3 rc:1002958 ro:25 xx:1] processed=[17847.1MB ~414450H/s] api=[918req/s 17607.9MB] runtime=36.4min
- 918x requests per second to
api.pwnedpasswords.com
- Log sources are shorthand:
lc
: 13616 from local-cache (lc) - request-responses handled locally without hitting the network.et
: 3 etag-matched (et) - request-responses that confirmed our local data was up-to-date and did not require a new download.rc
: 1002958 from remote-cache (rc) - request-responses that were downloaded to local, but came from the remote-server cache.ro
: 25 from remote-origin (ro) - request-responses that were downloaded to local, and the download needed to be fetched from remote origin source.xx
: 1 failed responses - request-responses that failed (and successfully retried).
- ~17GB downloaded in ~36 minutes (full dataset)
- Approx ~414k hash values received per second
- Processing in this example appears to be CPU bound, measured traffic around ~160 Mbit/s.
- Github - github.com/threatpatrols/hibp-downloader
- PyPI - pypi.org/project/hibp-downloader/
- ReadTheDocs - hibp-downloader.readthedocs.io
- Copyright © 2023-2024 Threat Patrols Pty Ltd
- Copyright © 2023-2024 Nicholas de Jong
All rights reserved.
- BSD-3-Clause - see LICENSE file for details.