Logo Extraction

This program extracts url of website logo given a website's url.

Getting started

This program will work for Python 3 and above. Git clone or download this repository. The program is tested for Windows 10, Linux (Ubuntu 16) and Mac OS X. The python version used in all these tests was 3.5 and higher.

Prerequistes

Firefox needs to be installed on your machine.

This program requires you to install selenium package. Installation instructions can be found here. (Note:There are specific instructions for Windows users.)

In general, running the following command in commandline should install selenium successfully. (Assuming pip is already installed.)

pip install selenium

A headless browser is used to fetch webpage content. The program uses firefox driver (It comes by default with the Selenium package) The headless browser needs geckodriver which can be found here. Based on the machine that you are using download the driver.

Config.file

The config file is present in the Logo_Extraction_master directory. Mention the path of geckodriver.

Running the script

logo_extraction.py

This file performs the logo extraction task. It accepts the input file or a url from command line.

To the run the file:

Open command prompt (The application is tested on anaconda command prompt) and cd to the Logo_Extraction_master directory.
Run the file as follows:

For a input file use:

python logo_extraction.py /path/to/your/input file/your_input_file.txt

For a url use:

python logo_extraction.py http://python.org

The name of the output file will be output.txt. The output would be written to a file in the same directory. The output format is website url, logo_url. For some websites the logo might be just stylized text. In such cases, the logo url will be blank.

Running tests

logo_extraction_test.py

Open command prompt (The application is tested on anaconda command prompt) and cd to the Logo_Extraction_master directory.
Use the command:

python -m unittest -v logo_extraction_test.py

Interpreting logs

The log is present in logo_extraction.log file. The log file will be stored in the same directory. The log shows information about number of urls processed, logo url sources by tags and errors like invalid urls.

Apart from the log generated by the script, geckodriver has its own log named geckodriver.log, which will be in the same directory. This log can be referred to for additional information.

Experimental Files

The Experimental directory has the logo extration implementation which is an attempt to use multiprocessing using the Pool class in multiprocessing library in Python. It is not included in the final implementation as the WebDriver in selenium is not thread-safe.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Experimental		Experimental
Test Report		Test Report
README.md		README.md
config.file		config.file
logo_extraction.py		logo_extraction.py
logo_extraction_test.py		logo_extraction_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Logo Extraction

Getting started

Prerequistes

Config.file

Running the script

logo_extraction.py

Running tests

logo_extraction_test.py

Interpreting logs

Experimental Files

About

Releases

Packages

Languages

ssb10/Logo_Extraction

Folders and files

Latest commit

History

Repository files navigation

Logo Extraction

Getting started

Prerequistes

Config.file

Running the script

logo_extraction.py

Running tests

logo_extraction_test.py

Interpreting logs

Experimental Files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages