You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
that crawls, scraps, indexes data and stores it into a database
The program is written in Python Language, uses regex to parse HTML, and MultiThreading to go faster.
The database part is assured by MongoDB
The Project contains 4 files:
PersonnalParser.py:
- Contains PersonnalParser class, that gets HTML content, parses it, stores it and starts new PersonnalParser Thread for each link in the page content.
DBManager.py
- Contains DBManager class, which assure the connexion with DB and inserting and/or finding operations.
fill_database.py:
- Contains the general settings like start URL, proxy settings and depth search. The first crawl Thread starts here.
main.py
- Contains the code that gets the user search, gets the database content and sorts the results by relevance.