The Dysdera Web Crawler is a Python-based asynchronous web crawler designed for extensibility and adaptability. Engineered to handle the often non-standard nature of the web, it provides fine-grained control over crawling policies.
This is not a finished project, feel free to collaborate
- Python version >= 3.9
- MongoDB for saving data (Community edition is free) or in alternative a Json file
- some python packages: to be installed with pip with the comand: 'pip install -r requirements.txt'
- motor to interact with MongoDB,
- json and aiofiles for saving in Json file
- asyncio for the asynchronous logic,
- aiohttp for http managing,
- lxml for the html parsing,
- brotli for http response compression,
- pytz and python-dateutil for more precise datetime management,
- chardet for encodin detection
You will find the documentation here
This is not a finished project. If you have something to add, do it!
-
dysderacrawler.py contains the logic of the crawler,
-
extractors.py contains the logic of the extractors,
-
policy.py contains the structure of the crawler policy,
-
selectionpolicy.py some selection policy
-
web.py the logic for manage webpages and more,
-
parser.py the necessary parser
-
logger.py the logic for the logs