GitHub - LukeFarch/COCrawlerWiki: This will crawl wikipedia for keywords based on Colorado and save them as a txt file.

Colorado Crawler Wikipedia

Welcome to the Colorado Crawler Wikipedia project, a solution designed for extracting and analyzing web data from Wikipedia pages related to Colorado. This can be used to train NLP models on words related to your needs.

Features

Web Scraping: Utilize wikipedia_scraper.py to crawl web pages and gather the data you need.
Easy Setup: Quick installation with a simple pip install seleniumbase

Installing

Clone the repository and set up the environment:

git clone https://github.com/LukeFarch/COCrawlerWiki.git
cd COCrawlerWiki

Change the paths in the code as necessary to match your environment and needs.

Executing the Program

To start crawling Wikipedia for Colorado-related pages, run:

python wikipedia_crawler.py

Do you wanna get a word count? This will tell you how many files are under 10 words (failed). Adjust as needed

word_count.py

Follow the prompts on screen to start crawling cities or counties

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
Colorado		Colorado
README.md		README.md
wikipedia_scraper.py		wikipedia_scraper.py
word_count.py		word_count.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Colorado Crawler Wikipedia

Features

Installing

Executing the Program

About

Releases

Packages

Languages

LukeFarch/COCrawlerWiki

Folders and files

Latest commit

History

Repository files navigation

Colorado Crawler Wikipedia

Features

Installing

Executing the Program

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages