Skip to content

This will crawl wikipedia for keywords based on Colorado and save them as a txt file.

Notifications You must be signed in to change notification settings

LukeFarch/COCrawlerWiki

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Colorado Crawler Wikipedia

Welcome to the Colorado Crawler Wikipedia project, a solution designed for extracting and analyzing web data from Wikipedia pages related to Colorado. This can be used to train NLP models on words related to your needs.

Features

  • Web Scraping: Utilize wikipedia_scraper.py to crawl web pages and gather the data you need.
  • Easy Setup: Quick installation with a simple pip install seleniumbase

Installing

  • Clone the repository and set up the environment:
git clone https://github.com/LukeFarch/COCrawlerWiki.git
cd COCrawlerWiki
  • Change the paths in the code as necessary to match your environment and needs.

Executing the Program

  • To start crawling Wikipedia for Colorado-related pages, run:
python wikipedia_crawler.py
  • Do you wanna get a word count? This will tell you how many files are under 10 words (failed). Adjust as needed
word_count.py
  • Follow the prompts on screen to start crawling cities or counties

About

This will crawl wikipedia for keywords based on Colorado and save them as a txt file.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages