Skip to content

This is logo extraction program implemented using Python

Notifications You must be signed in to change notification settings

ssb10/Logo_Extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logo Extraction

This program extracts url of website logo given a website's url.

Getting started

This program will work for Python 3 and above. Git clone or download this repository. The program is tested for Windows 10, Linux (Ubuntu 16) and Mac OS X. The python version used in all these tests was 3.5 and higher.

Prerequistes

Firefox needs to be installed on your machine.

This program requires you to install selenium package. Installation instructions can be found here. (Note:There are specific instructions for Windows users.)

In general, running the following command in commandline should install selenium successfully. (Assuming pip is already installed.)

pip install selenium

A headless browser is used to fetch webpage content. The program uses firefox driver (It comes by default with the Selenium package) The headless browser needs geckodriver which can be found here. Based on the machine that you are using download the driver.

Config.file

The config file is present in the Logo_Extraction_master directory. Mention the path of geckodriver.

Running the script

logo_extraction.py

This file performs the logo extraction task. It accepts the input file or a url from command line.

To the run the file:

  1. Open command prompt (The application is tested on anaconda command prompt) and cd to the Logo_Extraction_master directory.
  2. Run the file as follows:
  • For a input file use:
python logo_extraction.py /path/to/your/input file/your_input_file.txt
  • For a url use:
python logo_extraction.py http://python.org

The name of the output file will be output.txt. The output would be written to a file in the same directory. The output format is website url, logo_url. For some websites the logo might be just stylized text. In such cases, the logo url will be blank.

Running tests

logo_extraction_test.py

  1. Open command prompt (The application is tested on anaconda command prompt) and cd to the Logo_Extraction_master directory.
  2. Use the command:
python -m unittest -v logo_extraction_test.py

Interpreting logs

The log is present in logo_extraction.log file. The log file will be stored in the same directory. The log shows information about number of urls processed, logo url sources by tags and errors like invalid urls.

Apart from the log generated by the script, geckodriver has its own log named geckodriver.log, which will be in the same directory. This log can be referred to for additional information.

Experimental Files

The Experimental directory has the logo extration implementation which is an attempt to use multiprocessing using the Pool class in multiprocessing library in Python. It is not included in the final implementation as the WebDriver in selenium is not thread-safe.

About

This is logo extraction program implemented using Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages