This program extracts url of website logo given a website's url.
This program will work for Python 3 and above. Git clone or download this repository. The program is tested for Windows 10, Linux (Ubuntu 16) and Mac OS X. The python version used in all these tests was 3.5 and higher.
Firefox needs to be installed on your machine.
This program requires you to install selenium package. Installation instructions can be found here. (Note:There are specific instructions for Windows users.)
In general, running the following command in commandline should install selenium successfully. (Assuming pip is already installed.)
pip install selenium
A headless browser is used to fetch webpage content. The program uses firefox driver (It comes by default with the Selenium package) The headless browser needs geckodriver which can be found here. Based on the machine that you are using download the driver.
The config file is present in the Logo_Extraction_master directory. Mention the path of geckodriver.
This file performs the logo extraction task. It accepts the input file or a url from command line.
To the run the file:
- Open command prompt (The application is tested on anaconda command prompt) and cd to the Logo_Extraction_master directory.
- Run the file as follows:
- For a input file use:
python logo_extraction.py /path/to/your/input file/your_input_file.txt
- For a url use:
python logo_extraction.py http://python.org
The name of the output file will be output.txt. The output would be written to a file in the same directory. The output format is website url, logo_url. For some websites the logo might be just stylized text. In such cases, the logo url will be blank.
- Open command prompt (The application is tested on anaconda command prompt) and cd to the Logo_Extraction_master directory.
- Use the command:
python -m unittest -v logo_extraction_test.py
The log is present in logo_extraction.log file. The log file will be stored in the same directory. The log shows information about number of urls processed, logo url sources by tags and errors like invalid urls.
Apart from the log generated by the script, geckodriver has its own log named geckodriver.log, which will be in the same directory. This log can be referred to for additional information.
The Experimental directory has the logo extration implementation which is an attempt to use multiprocessing using the Pool
class in multiprocessing
library in Python. It is not included in the final implementation as the WebDriver in selenium is not thread-safe.