Master_Scraper

A command line scraper which can scrape images and text.

This program uses the html2text module (https://pypi.org/project/html2text/) to iterate through a list of URLs from a spreadsheet and extract all the text on every webpage. It can also extract all the images from every webpage, avoiding duplicates.

N.B. This program works best and is intended for scraping content off of one domain.

Usage

To use this program:

First, create a spreadhseet with one column which lists all the URLs you need to scrape the text or images from. (No headers are necessary, but the URLs must be in a clean format i.e. beginning with http:// or https:// and ending with / or white space).
Copy them all to your clipboard.
Run the program in your shell or IDE.
The program will ask you if you want to scrape images, text or both.

Output

The program will save the files it creates to the same directory you run the file from e.g. if the file is located and run from c:\Users\name\Documents\scraper, all the .jpg and .txt files it creates will also be saved there.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
__pycache__		__pycache__
README.md		README.md
main.py		main.py
scraper_functions.py		scraper_functions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Master_Scraper

About

Releases

Packages

Languages

ben-meyer/Master_Scraper

Folders and files

Latest commit

History

Repository files navigation

Master_Scraper

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages