Skip to content

P4o1o/Dysdera

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dysdera Web Crawler

asynchronous web crawler implementation written in Python

The Dysdera Web Crawler is a Python-based asynchronous web crawler designed for extensibility and adaptability. Engineered to handle the often non-standard nature of the web, it provides fine-grained control over crawling policies.

This is not a finished project, feel free to collaborate

Dependencies:

  • Python version >= 3.9
  • MongoDB for saving data (Community edition is free) or in alternative a Json file
  • some python packages: to be installed with pip with the comand: 'pip install -r requirements.txt'
    • motor to interact with MongoDB,
    • json and aiofiles for saving in Json file
    • asyncio for the asynchronous logic,
    • aiohttp for http managing,
    • lxml for the html parsing,
    • brotli for http response compression,
    • pytz and python-dateutil for more precise datetime management,
    • chardet for encodin detection

How to use it?

You will find the documentation here

Disclamer

This is not a finished project. If you have something to add, do it!

files structure:

  • dysderacrawler.py contains the logic of the crawler,

  • extractors.py contains the logic of the extractors,

  • policy.py contains the structure of the crawler policy,

  • selectionpolicy.py some selection policy

  • web.py the logic for manage webpages and more,

  • parser.py the necessary parser

  • logger.py the logic for the logs