Supacrawler

Supacrawler's ultralight engine for scraping and crawling the web. Written in Go for maximum performance and concurrency. The open-source engine powering Supacrawler.com.

A standalone HTTP service for scraping, mapping, crawling, and screenshots. It runs a web API with a background worker (Redis + Asynq). Routes match the existing Supacrawler SDKs under /v1.

Why open source? We believe powerful web scraping technology should be accessible to everyone. Whether you're a solo developer, startup, or enterprise - you shouldn't have to choose between quality and affordability. Read our open source announcement →

Quick Start

Docker (Recommended)

Option A: Docker Compose

curl -O https://raw.githubusercontent.com/supacrawler/supacrawler/main/docker-compose.yml
docker compose up

Option B: Manual Docker

docker run -d --name redis -p 6379:6379 redis:7-alpine
docker run --rm -p 8081:8081 \
  -e REDIS_ADDR=host.docker.internal:6379 \
  ghcr.io/supacrawler/supacrawler:latest

Binary Download

For advanced users who prefer native binaries:

Download from releases page
Install dependencies: Redis + Node.js + Playwright v1.49.1
Run: ./supacrawler --redis-addr=127.0.0.1:6379

Note: Docker is recommended for easier setup. See complete local development guide →

Requirements

Dependencies:

Redis - for job queuing and background processing
Playwright - for JavaScript rendering and screenshots

Usage

Start the Server

# 1. Make sure Redis is running
brew services start redis
# OR: docker run -d --name redis -p 6379:6379 redis:7-alpine

# 2. Start Supacrawler
supacrawler --redis-addr=127.0.0.1:6379

What you'll see:

🕷️ Supacrawler Engine
├─ Server: http://127.0.0.1:8081
├─ Health: http://127.0.0.1:8081/v1/health  
└─ API Docs: http://127.0.0.1:8081/docs

API Examples

# Health check
curl http://localhost:8081/v1/health

# Scrape a webpage
curl "http://localhost:8081/v1/scrape?url=https://example.com&format=markdown"

# Take a screenshot
curl -X POST http://localhost:8081/v1/screenshots \
  -H 'Content-Type: application/json' \
  -d '{"url":"https://example.com","full_page":true}'

📸 JavaScript Rendering & Screenshots

This is supacrawler's core functionality - modern web scraping requires JS rendering.

One-line install handles this automatically. For manual installs:

# Install Node.js and Playwright
npm install -g playwright
playwright install chromium --with-deps

Without Playwright:

❌ Screenshots fail completely
❌ SPAs return empty content

With Docker: Everything works out of the box (Playwright included).

Learn more about JavaScript rendering →

Configuration

You can configure Supacrawler using environment variables or a .env file. Copy .env.example to .env and modify as needed.

Core Settings

HTTP_ADDR - Server address (default: :8081)
REDIS_ADDR - Redis address (default: 127.0.0.1:6379)
DATA_DIR - Data directory (default: ./data)

Optional Settings

REDIS_PASSWORD - Redis password (if required)
SUPABASE_URL - Supabase project URL (for cloud storage)
SUPABASE_SERVICE_KEY - Supabase service key
SUPABASE_STORAGE_BUCKET - Storage bucket name (default: screenshots)

Development & Contributing

New to SupaCrawler? Read our comprehensive development guide → or browse tutorials →

Native Go Development

git clone https://github.com/supacrawler/supacrawler.git
cd supacrawler

# Copy environment template
cp .env.example .env

# Edit .env with your configuration
# Set environment variables (or use .env file)
export REDIS_ADDR=127.0.0.1:6379
export HTTP_ADDR=:8081
export DATA_DIR=./data

# Optional: enable Supabase storage upload/sign
export SUPABASE_URL=http://127.0.0.1:64321
export SUPABASE_SERVICE_KEY=<service_key>
export SUPABASE_STORAGE_BUCKET=screenshots

# Ensure Redis is running
brew services start redis
# OR: docker run -d --name redis -p 6379:6379 redis:7-alpine

# Run the server
go mod tidy
go run ./cmd/main.go

Hot Reload Development (Air)

# Install Air for hot reloading
go install github.com/air-verse/air@latest

# Set environment variables (same as above)
export REDIS_ADDR=127.0.0.1:6379
export HTTP_ADDR=:8081
export DATA_DIR=./data

# Run with hot reload
air

Docker Development with Hot Reload

For the best development experience with automatic code reloading:

# Start all services with hot reload enabled
docker compose -f docker-compose.dev.yml up --build

# Or run in detached mode
docker compose -f docker-compose.dev.yml up --build -d

# View logs
docker compose -f docker-compose.dev.yml logs -f supacrawler-dev

# Stop services
docker compose -f docker-compose.dev.yml down

What you get:

✅ Automatic code reloading on file changes (via Air)
✅ Source code mounted as volumes
✅ Redis included and configured
✅ No need to rebuild on code changes

How it works: The docker-compose.dev.yml uses Dockerfile.dev which includes Air for hot reloading. Your local source code is mounted into the container, so any changes you make are immediately detected and the server automatically restarts.

Docker Development (Manual)

For manual Docker builds without hot reload:

# Start Redis
docker run -d --name redis -p 6379:6379 redis:7-alpine

# Build and run scraper
docker build -t supacrawler:dev .
docker run --rm \
  -p 8081:8081 \
  -e REDIS_ADDR=host.docker.internal:6379 \
  -e HTTP_ADDR=":8081" \
  -e DATA_DIR="/app/data" \
  -e SUPABASE_URL="http://host.docker.internal:64321" \
  -e SUPABASE_SERVICE_KEY="<service_key>" \
  -e SUPABASE_STORAGE_BUCKET="screenshots" \
  -v "$(pwd)/data:/app/data" \
  --name supacrawler \
  supacrawler:dev

One-shot Development Scripts

# Docker setup
./scripts/run.sh

# Hot reload setup
./scripts/run.sh --reload

API Reference

Base URL: http://localhost:8081/v1

Complete API documentation: docs.supacrawler.com

Health Check

curl -s http://localhost:8081/internal/health

Scraping

# Scrape page (markdown format, links always included)
curl -s "http://localhost:8081/v1/scrape?url=https://supacrawler.com"

Crawling

# Create crawl job
curl -s -X POST http://localhost:8081/v1/crawl \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://supacrawler.com",
    "type": "crawl",
    "format": "markdown",
    "depth": 2,
    "link_limit": 20,
    "include_subdomains": true,
    "include_html": false
  }'

# Get job status
curl -s http://localhost:8081/v1/crawl/<job_id>

Screenshots

# Create screenshot job
curl -s -X POST http://localhost:8081/v1/screenshots \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://supacrawler.com",
    "full_page": true,
    "format": "png",
    "width": 1366,
    "height": 768
  }'

# Get screenshot
curl -s "http://localhost:8081/v1/screenshots?job_id=<job_id>"

# Synchronous screenshot (stream to file)
curl -s -X POST http://localhost:8081/v1/screenshots \
  -H 'Content-Type: application/json' \
  -d '{"url":"https://supacrawler.com","full_page":true,"format":"png","stream":true}' \
  --output example.png

Storage Behavior

If SUPABASE_URL and SUPABASE_SERVICE_KEY are set, images are uploaded to SUPABASE_STORAGE_BUCKET and a signed URL is returned.
Otherwise, files are written under DATA_DIR/screenshots and served via /files/screenshots/<name>.

SDKs

Use the official SDKs to integrate with your applications:

JavaScript/TypeScript

import { SupacrawlerClient } from '@supacrawler/js'

const client = new SupacrawlerClient({ 
  apiKey: 'anything', 
  baseUrl: 'http://localhost:8081/v1' 
})

const result = await client.scrape({ 
  url: 'https://supacrawler.com', 
  format: 'markdown' 
})

Python

from supacrawler import SupacrawlerClient

client = SupacrawlerClient(
  api_key='anything', 
  base_url='http://localhost:8081/v1'
)

result = client.scrape({ 
  'url': 'https://supacrawler.com', 
  'format': 'markdown' 
})

Tutorials & Guides:

Environment Variables

HTTP_ADDR - Server address (default: :8081)
REDIS_ADDR - Redis address (default: 127.0.0.1:6379)
REDIS_PASSWORD - Redis password (optional)
DATA_DIR - Data directory (default: ./data)
SUPABASE_URL - Supabase project URL (optional)
SUPABASE_SERVICE_KEY - Supabase service key (optional)
SUPABASE_STORAGE_BUCKET - Supabase storage bucket name (optional)

Contributing

We welcome contributions! Please see our development setup above to get started.

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and test locally
Submit a pull request

Community Resources:

Contributing guidelines
Development blog posts with technical deep dives
Issue tracker for bugs and features
Discussions for questions and ideas

License

Licensed under the Apache License 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
cmd		cmd
internal		internal
openapi		openapi
prompts		prompts
scripts		scripts
.air.toml		.air.toml
.env.example		.env.example
.gitignore		.gitignore
.goreleaser.yml		.goreleaser.yml
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
LICENSE		LICENSE
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
main		main
run.sh		run.sh

License

supacrawler/supacrawler

Folders and files

Latest commit

History

Repository files navigation

Supacrawler

Quick Start

Docker (Recommended)

Binary Download

Requirements

Usage

Start the Server

API Examples

📸 JavaScript Rendering & Screenshots

Configuration

Core Settings

Optional Settings

Development & Contributing

Native Go Development

Hot Reload Development (Air)

Docker Development with Hot Reload

Docker Development (Manual)

One-shot Development Scripts

API Reference

Health Check

Scraping

Crawling

Screenshots

Storage Behavior

SDKs

JavaScript/TypeScript

Python

Environment Variables

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages