Skip to content

Latest commit

 

History

History
34 lines (22 loc) · 1 KB

README.md

File metadata and controls

34 lines (22 loc) · 1 KB

Colorado Crawler Wikipedia

Welcome to the Colorado Crawler Wikipedia project, a solution designed for extracting and analyzing web data from Wikipedia pages related to Colorado. This can be used to train NLP models on words related to your needs.

Features

  • Web Scraping: Utilize wikipedia_scraper.py to crawl web pages and gather the data you need.
  • Easy Setup: Quick installation with a simple pip install seleniumbase

Installing

  • Clone the repository and set up the environment:
git clone https://github.com/LukeFarch/COCrawlerWiki.git
cd COCrawlerWiki
  • Change the paths in the code as necessary to match your environment and needs.

Executing the Program

  • To start crawling Wikipedia for Colorado-related pages, run:
python wikipedia_crawler.py
  • Do you wanna get a word count? This will tell you how many files are under 10 words (failed). Adjust as needed
word_count.py
  • Follow the prompts on screen to start crawling cities or counties