Image scraper for Google Image to collect images from their official websites (Python3 and Selenium)
This code is used to scrap images from Google Image. Typical image scrapers query a keyword in Google Image and download the result images. However, this technique returns a small image size without their original names.
This crawler collects images from their original websites. Basically, it follows the next logic:
- Query Google Image for a particular query, e.g. "cat"
- Using Selenium library, it opens a web page and collects and scroll down the Google Image page to obtain as many images as possible
- Then, the result HTML code of the Google Image is saved locally
- Using the Beautiful Soup library, the crawler parses the original websites of each image and visits them website individually 5- For each website, the crawler collects all the images and save them locally in a new folder, named as the query.
For this version, I hardcoded the following parameters, but they need to be changed when you use the script:
- Selenium driver path ( you can download it online, just google it)
- The query keywords
- The output folder name
When you run the Jupyter notebook, you will get a folder called "Dataset". Inside Dataset, you will have two sub-folders:
- images: this folder will contain a folder for the images of each query.
- soups: HTML code dump from Google Image with the query
The requirements for this project are:
- BeautifulSoup
- Selenium (I'm using FireFox here)
- Python3, of course :)
- tqdm
- requests
Just download everything with pip, and you are ready to go!