Welcome to the Colorado Crawler Wikipedia project, a solution designed for extracting and analyzing web data from Wikipedia pages related to Colorado. This can be used to train NLP models on words related to your needs.
- Web Scraping: Utilize
wikipedia_scraper.py
to crawl web pages and gather the data you need. - Easy Setup: Quick installation with a simple
pip install seleniumbase
- Clone the repository and set up the environment:
git clone https://github.com/LukeFarch/COCrawlerWiki.git
cd COCrawlerWiki
- Change the paths in the code as necessary to match your environment and needs.
- To start crawling Wikipedia for Colorado-related pages, run:
python wikipedia_crawler.py
- Do you wanna get a word count? This will tell you how many files are under 10 words (failed). Adjust as needed
word_count.py
- Follow the prompts on screen to start crawling cities or counties