Skip to content

A crawler for scraping essential information of products from the Digikala website

Notifications You must be signed in to change notification settings

HesamKorki/digikala-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Digikala Product Crawler

create a virtual environment using venv wrapper and activate it:

python3 -m venv env
source env/bin/activate

install the dependencies:

pip install -r requirements.txt

Crawling Process

The spider crawls from the homepage and navigates through each main category. Then, it would crawl the first 5 pages of each primary sub category to extract concise information about each product. The most visited order filter on products fits our advertising purpose.

The output can be saved to a .jl file, each product in a new line, with the following command:

cd Crawler && scrapy runspider productSpider.py -o products.jl

Cleaning the data

The output of the crawlers contains faulty and redundant data. The cleanData.py script would create a comma separated values of these columns ['index', 'id', 'category', 'title', 'price', 'url']. Scrapy sends string as unicode not ascii. The built-in ast package to encode the results and transform the string to python dictionary.

python cleanData.py

The csv file would be generated as products.csv

About

A crawler for scraping essential information of products from the Digikala website

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages