Skip to content

Python Concurrent Crawler based on Playwright browser automation. It leverages the browser tabs for parallelization. Also, it is possible to scale it horizontally by using the Redis queue.

License

Notifications You must be signed in to change notification settings

bzhr/concurrent_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Strive Miner — Concurrent Web Crawler

A modular, async, headless-browser crawler that parallelizes with browser tabs and scales horizontally with a shared task queue. Choose between: • Redis-backed global queue (recommended for many bots / machines) • Pure-Python local queue (no external services; great for dev & tests)

Key features

•	At-least-once delivery with leases & requeue on timeout
•	Global de-dupe via seen set (Redis or local)
•	Robust workers per browser tab (Playwright)
•	Clean modular layout (queues, crawler logic, CLI, storage, extractors)
•	Drop-in queue swap via BaseQueue interface

Project structure

src/
  queues/
    __init__.py
    base.py          # BaseQueue interface (push, push_many, reserve, ack)
    local.py         # LocalQueue: asyncio.Queue + sets, pure Python
    redisq.py        # RedisQueue: LIST + SET + ZSET (leases), optional
  crawler/
    __init__.py
    model.py         # (optional) types/dataclasses for results
    extract.py       # HTML parsing & link/text extraction helpers
    storage.py       # persistence helpers (e.g., save_docs)
    worker.py        # Crawler class (logic only, no CLI)
  cli/
    __init__.py
    crawl.py         # Typer CLI: wires queue + crawler + storage
  utils/
    __init__.py
    settings.py      # configuration (env defaults: REDIS_URL, timeouts)

Installation

Prereqs

•	Python 3.11+ (recommended 3.12)
•	Playwright
•	(Optional) Redis 7+

using uv (recommended)

uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

or pip

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Playwright browsers
python -m playwright install

Redis (optional)

macOS:

brew install redis
brew services start redis

Linux

sudo apt-get install redis-server
sudo systemctl enable --now redis-server

Configuration

Environment variables (with defaults):

export REDIS_URL="redis://localhost:6379"
export HEADLESS="1"            # set to "0" for headed browser
export CRAWL_TIMEOUT_MS="50000"
export LEASE_SEC="90"

Values are read in utils/settings.py and can be overridden via CLI flags.

CLI

We expose a small Typer CLI in src/cli/crawl.py.

Script entry point (optional)

Short command is defined in pyproject.toml

[project.scripts]
strive-crawl = "src.cli.crawl:app"

Then:

strive-crawl run --start-url https://example.com --use-redis false

How it works (architecture)

1) Concurrency model

•	Process level: run multiple crawler processes (even across machines).
•	Within a process: N workers (async tasks), each owns one browser tab.
•	Work acquisition: workers call queue.reserve(lease_sec) to lease a URL.

2) Queue interface

BaseQueue defines a minimal contract:

class BaseQueue(ABC):
    async def push(self, item: str) -> bool: ...
    async def push_many(self, items: list[str]) -> int: ...
    async def reserve(self, lease_sec: int = 90) -> str | None: ...
    async def ack(self, item: str) -> bool: ...

Two implementations:

  • LocalQueue: in-process asyncio.Queue + Python set/dict. Great for local dev/tests; one process only.
  • RedisQueue:
  • {ns}:pending → LIST queue (LPUSH/RPOP)
  • {ns}:seen → SET for global de-dupe (SADD)
  • {ns}:inflight → ZSET for lease deadlines (ZADD with Unix ts)

Lease flow:

  1. Worker calls reserve(lease_sec).
  2. Queue requeues expired leases first (visibility timeout).
  3. Pop one item from pending, set deadline (now + lease_sec) in inflight.
  4. On success, worker calls ack(item) → remove from inflight.
  5. If worker crashes or times out → item becomes visible again after deadline.

This provides at-least-once delivery (idempotent page processing recommended).

3) Extraction pipeline

•	Navigate with Playwright → page.content()
•	clean_html() to normalize HTML (your util)
•	Parse with BeautifulSoup → text + links
•	Same-domain filter → relative resolved with urljoin
•	De-dupe locally (avoid repeated work within a single bot run)
•	Enqueue newly discovered links with push_many

4) Persistence

•	crawler.storage.save_docs(contents, out_dir) writes text to page_*.html
•	Swap in other writers (JSONL, CSV, Parquet, vector DB) without touching the crawler core.

Programmatic usage

import asyncio
from src.crawler.worker import Crawler
from src.queues.local import LocalQueue
# from src.queues.redisq import RedisQueue

async def main():
    # Pure-Python queue:
    queue = LocalQueue()

    # Or Redis:
    # from urllib.parse import urlparse
    # ns = f"crawl:{urlparse('https://example.com').netloc}"
    # queue = RedisQueue(redis_url="redis://localhost:6379", namespace=ns)

    crawler = Crawler(
        start_url="https://example.com",
        queue=queue,
        max_tabs=8,
        max_urls=300,
        lease_sec=90,
        seed_queue=True,
    )
    await crawler.start()

asyncio.run(main())

Running many bots in parallel

1.	Start Redis (optional but recommended for multi-bot).
2.	Use the same namespace (e.g., crawl:example.com) across all bots.
3.	Launch N processes (and/or machines). Each bot:
•	reserve → crawl → ack
•	Pushes newly discovered links → global seen prevents re-adds

Tip: Only one bot needs to seed the start URL; safe to call on all bots (SET prevents duplicates).

Tuning & tips

•	lease_sec: set to (p95 crawl time) + small buffer.
•	Max tabs: start with 6–10; raise cautiously (sites throttle).
•	Headed mode: set HEADLESS=0 while debugging.
•	FIFO vs LIFO: We use LPUSH/RPOP (queue-ish). Switch to RPUSH/LPOP if you need true FIFO.
•	Namespaces: namespace=f"crawl:{domain}" isolates per-domain queues.

Troubleshooting

•	“redis.asyncio not available” → install redis>=5 or run --use-redis false.
•	Playwright navigation timeouts → increase CRAWL_TIMEOUT_MS or reduce tabs.
•	Duplicate pages → expected in at-least-once systems if a worker dies mid-page. De-dupe downstream or make processing idempotent.

Roadmap

•	Robots.txt & rate limiting
•	Domain sharding / frontier prioritization (e.g., BFS/DFS, score-based)
•	Structured content extraction (schema-driven)
•	Persistent storage adapters (S3, SQLite, Postgres, Parquet)
•	Metrics & Prometheus exporter
•	Retry policy with DLQ (Redis streams or RMQ dead letters)

License

MIT (or your preferred license).

About

Python Concurrent Crawler based on Playwright browser automation. It leverages the browser tabs for parallelization. Also, it is possible to scale it horizontally by using the Redis queue.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages