A modular, async, headless-browser crawler that parallelizes with browser tabs and scales horizontally with a shared task queue. Choose between: • Redis-backed global queue (recommended for many bots / machines) • Pure-Python local queue (no external services; great for dev & tests)
• At-least-once delivery with leases & requeue on timeout
• Global de-dupe via seen set (Redis or local)
• Robust workers per browser tab (Playwright)
• Clean modular layout (queues, crawler logic, CLI, storage, extractors)
• Drop-in queue swap via BaseQueue interface
src/
queues/
__init__.py
base.py # BaseQueue interface (push, push_many, reserve, ack)
local.py # LocalQueue: asyncio.Queue + sets, pure Python
redisq.py # RedisQueue: LIST + SET + ZSET (leases), optional
crawler/
__init__.py
model.py # (optional) types/dataclasses for results
extract.py # HTML parsing & link/text extraction helpers
storage.py # persistence helpers (e.g., save_docs)
worker.py # Crawler class (logic only, no CLI)
cli/
__init__.py
crawl.py # Typer CLI: wires queue + crawler + storage
utils/
__init__.py
settings.py # configuration (env defaults: REDIS_URL, timeouts)
• Python 3.11+ (recommended 3.12)
• Playwright
• (Optional) Redis 7+
uv venv
source .venv/bin/activate
uv pip install -r requirements.txtpython -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt# Playwright browsers
python -m playwright installmacOS:
brew install redis
brew services start redisLinux
sudo apt-get install redis-server
sudo systemctl enable --now redis-serverEnvironment variables (with defaults):
export REDIS_URL="redis://localhost:6379"
export HEADLESS="1" # set to "0" for headed browser
export CRAWL_TIMEOUT_MS="50000"
export LEASE_SEC="90"
Values are read in utils/settings.py and can be overridden via CLI flags.
We expose a small Typer CLI in src/cli/crawl.py.
Short command is defined in pyproject.toml
[project.scripts]
strive-crawl = "src.cli.crawl:app"Then:
strive-crawl run --start-url https://example.com --use-redis false• Process level: run multiple crawler processes (even across machines).
• Within a process: N workers (async tasks), each owns one browser tab.
• Work acquisition: workers call queue.reserve(lease_sec) to lease a URL.
BaseQueue defines a minimal contract:
class BaseQueue(ABC):
async def push(self, item: str) -> bool: ...
async def push_many(self, items: list[str]) -> int: ...
async def reserve(self, lease_sec: int = 90) -> str | None: ...
async def ack(self, item: str) -> bool: ...Two implementations:
- LocalQueue: in-process asyncio.Queue + Python set/dict. Great for local dev/tests; one process only.
- RedisQueue:
- {ns}:pending → LIST queue (LPUSH/RPOP)
- {ns}:seen → SET for global de-dupe (SADD)
- {ns}:inflight → ZSET for lease deadlines (ZADD with Unix ts)
Lease flow:
- Worker calls reserve(lease_sec).
- Queue requeues expired leases first (visibility timeout).
- Pop one item from pending, set deadline (now + lease_sec) in inflight.
- On success, worker calls ack(item) → remove from inflight.
- If worker crashes or times out → item becomes visible again after deadline.
This provides at-least-once delivery (idempotent page processing recommended).
• Navigate with Playwright → page.content()
• clean_html() to normalize HTML (your util)
• Parse with BeautifulSoup → text + links
• Same-domain filter → relative resolved with urljoin
• De-dupe locally (avoid repeated work within a single bot run)
• Enqueue newly discovered links with push_many
• crawler.storage.save_docs(contents, out_dir) writes text to page_*.html
• Swap in other writers (JSONL, CSV, Parquet, vector DB) without touching the crawler core.
import asyncio
from src.crawler.worker import Crawler
from src.queues.local import LocalQueue
# from src.queues.redisq import RedisQueue
async def main():
# Pure-Python queue:
queue = LocalQueue()
# Or Redis:
# from urllib.parse import urlparse
# ns = f"crawl:{urlparse('https://example.com').netloc}"
# queue = RedisQueue(redis_url="redis://localhost:6379", namespace=ns)
crawler = Crawler(
start_url="https://example.com",
queue=queue,
max_tabs=8,
max_urls=300,
lease_sec=90,
seed_queue=True,
)
await crawler.start()
asyncio.run(main())1. Start Redis (optional but recommended for multi-bot).
2. Use the same namespace (e.g., crawl:example.com) across all bots.
3. Launch N processes (and/or machines). Each bot:
• reserve → crawl → ack
• Pushes newly discovered links → global seen prevents re-adds
Tip: Only one bot needs to seed the start URL; safe to call on all bots (SET prevents duplicates).
⸻
• lease_sec: set to (p95 crawl time) + small buffer.
• Max tabs: start with 6–10; raise cautiously (sites throttle).
• Headed mode: set HEADLESS=0 while debugging.
• FIFO vs LIFO: We use LPUSH/RPOP (queue-ish). Switch to RPUSH/LPOP if you need true FIFO.
• Namespaces: namespace=f"crawl:{domain}" isolates per-domain queues.
⸻
• “redis.asyncio not available” → install redis>=5 or run --use-redis false.
• Playwright navigation timeouts → increase CRAWL_TIMEOUT_MS or reduce tabs.
• Duplicate pages → expected in at-least-once systems if a worker dies mid-page. De-dupe downstream or make processing idempotent.
⸻
• Robots.txt & rate limiting
• Domain sharding / frontier prioritization (e.g., BFS/DFS, score-based)
• Structured content extraction (schema-driven)
• Persistent storage adapters (S3, SQLite, Postgres, Parquet)
• Metrics & Prometheus exporter
• Retry policy with DLQ (Redis streams or RMQ dead letters)
⸻
License
MIT (or your preferred license).