proxy_tower

A proxy load balance server allows web crawlers to use proxy pool more effectively

中文文档

trying to solve:

Free proxies usually have low success rate
Payment proxies have uncertain expiration time and are difficult to make full use of
Avoid using invalid proxies constantly

Note: proxy_tower itself does not seek proxies

Features

Multiple forwarding
- Forward request to multiple proxies
- Return the fastest and valid response

Multiple forwarding can increase the success rate of using free or unstable proxies

Response verification
- Pattern is a reused page of the target site with same URL prefix and similar HTML structure，such as movie.douban.com/subject/ for https://movie.douban.com/subject/6981153/
- Patterns and verification rules are stored in a prefix tree which helps verify responses from different sites easily and effectively
- Separated proxy pools for different patterns

Requirements

Python >= 3.6
redis server

Getting started

pip install -r requirements.txt
python proxy_entrance.py
curl -x "http://0.0.0.0:8893" "http://www.httpbin.org/ip"

config.py

global_blacklist = [
    'antispider',
    'forbidden'
]  # global keywords blacklist. if response contains any words in it, response is considered invalid

# proxy_tower relies heavily on redis which is used for storing proxies and validation rules
redis_host = getenv('redis_host', 'redis')
redis_port = getenv('redis_port', 6379)
redis_db = getenv('redis_db', 0)
redis_password = getenv('redis_password', '')
redis_addr = 'redis://{}:{}/{}'.format(redis_host, redis_port, redis_db)

Docker

docker pull worldwonderer/proxy_tower

docker run redis_host=<redis-ip> --env redis_port=<6379> --env redis_password=<foobared> -p 8893:8893 worldwonderer/proxy_tower

Response Verification

Currently support 2 kinds of verification rules

whitelist If the response contains specified keywords, response is determined to be valid
xpath If xpath can extract specified value from response, response is determined to be valid

import json
import redis

r = redis.StrictRedis()
# whitelist
r.hset("response_check_pattern", "movie.douban.com/subject/", json.dumps({'rule': 'whitelist', 'value':'ratingValue'}))

# xpath
r.hset("response_check_pattern", "movie.douban.com/subject/", json.dumps({'rule': '//*[@id="recommendations"]/h2/i', 'value':'喜欢这部电影的人也喜欢'}))

After configuring the verification rule for the pattern movie.douban.com/subject/，when you crawl web pages like https://movie.douban.com/subject/27119724/，proxy_tower will verify the content of response and score the proxy

Adding proxies

You can add proxy source in models/proxy.py through file or API

# File
class ProxyFile(ProxySource):

    def __init__(self, tag, file_path):
        self.file_path = file_path
        self.tag = tag

    async def fetch_proxies(self):
        with open(self.file_path, 'r') as f:
            proxy_candidates = re.findall(self.proxy_pattern, f.read())
            for proxy in proxy_candidates:
                yield Proxy.parse(proxy, tag=self.tag, support_https=True, paid=False)


# API
class ProxyApi(ProxySource):

    def __init__(self, tag, api, valid_time):
        self.api = api
        self.tag = tag
        self.valid_time = valid_time

    async def fetch_proxies(self):
        r = await crawl("GET", self.api)
        text = await r.text()
        proxy_candidates = re.findall(self.proxy_pattern, text)
        for proxy in proxy_candidates:
            yield Proxy.parse(proxy, tag=self.tag, valid_time=self.valid_time, paid=False)

Proxies from different proxy source have their own properties, you can tag the proxy and initialize properties at the very beginning

valid_time
support_https
paid

Dashboard

proxy_tower_dashboard

Display all proxies and their info
View, modify or add patterns
A Line chart of each pattern's success rate

HTTPS

For https must sites，put 'Need-Https': 'yes' in headers，proxy_tower will pick proxies with support_https tag

Note：do not add https in URL, e.g. use http://www.bilibili.com instead of https://www.bilibili.com

Todo

Test
Support conditional expressions in verification rules

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
conf		conf
core		core
models		models
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
config.py		config.py
docker-compose.yml		docker-compose.yml
log_utils.py		log_utils.py
proxy_entrance.py		proxy_entrance.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

proxy_tower

Features

Requirements

Getting started

config.py

Docker

Response Verification

Adding proxies

Dashboard

HTTPS

Todo

About

Releases

Packages

Languages

License

flyowl/proxy_tower

Folders and files

Latest commit

History

Repository files navigation

proxy_tower

Features

Requirements

Getting started

config.py

Docker

Response Verification

Adding proxies

Dashboard

HTTPS

Todo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages