Skip to content

Latest commit

 

History

History
109 lines (72 loc) · 3.39 KB

README_en.md

File metadata and controls

109 lines (72 loc) · 3.39 KB

Scrapy-Kafka Demo


  • 中文介绍(Chinese) Chinese README.md Here

Say somethine

  • If you meet some problem when running this demo, you can send email or make an issue to me.
  • Email: [email protected]
  • QQ: 379978424

Preparation

  • Zookeeper Environment(zookeeper-3.4.10) -> Install Step
  • Kafka Environment(kafka-1.0.0) -> Install
  • System Environment(Win10 x64)
  • Python Environment(Python 3.4.4)

Dependency

  • Environment
    • Python3.4.4(Python2 not test. If someone has tested it, please tell me or make an issue.)
  • Library
    • Scrapy
    • pykafka
  • How to install:
windows: pip install requirements.txt
linux: pip3 install requiremnets.txt

Project Structure

  • consumer --- pykafka consumer module(Test reception and receive Scrapy data)
  • producer --- pykafka producer modele(Test send something)
  • scrapy_kafka --- Scrapy + pykafka spider(Crawl all html a tag from my campus website(main page).)

Pay attention

  • I choose some special point about kafka instead of Scrapy.
  1. kafka need bytes data for transfer, so when pipeline are receive the data. You must use encode method; Encode method`s argument encoding should same as with consumer decode arguments.
  2. Implement close_spider(self, spider) method in pipeline for shutdown the producer; Otherwise Scrapy will hang on for wating producer close.
  3. I code a method in pipeline to judge setting KAFKA_IP_PORT:
    • Single deployment can use list or str.
    • Pseudo distributed or fully distributed can use list, or use str(multiple IP and ports are separated by comma). -> e.g: "192.168.1.101:9092,192.168.1.102:9092"

Zookeeper Install

  • Single Deployment
  1. Download zookeeper.Download link
  2. Unzip the package, and get into conf, copy zoo_sample.cfg to zoo.cfg or rename.
  3. set zookeeper root path as ZOOKEEPER_HOME as system environment variable
  4. add zookeeper bin path in PATH.
  5. run zkServer in cmd.

Kafka安装

  • Single Deployment
  1. Download kafka and unzip.Download link
  2. Get into conf, and edit server.properties variable log.dirs to somewhere you make folder for save log.
  3. Configure system environment variable KAFKA_HOME.
  4. add kafka bin/windows path in PATH(if you use linux juet set bin in PATH
  5. run kafka: kafka-server-start
  • Test kafka
  1. Create topic:
kafka-topics --create --topic newtest --partitions 1 --replication-factor 1 --zookeeper localhost:2181 
  1. Create producer:
kafka-console-producer --broker-list localhost:9092 --topic newtest  
  • At the moment, the window is in the state of waiting for input. Don`t close it, please run consumer.
  1. Create consumer
kafka-console-consumer  --zookeeper localhost:2181 --topic newtest
  • When consumer is start success. You can back to producer window and input something. Then you can check consumer window receiver something you input at producer window.
  1. Other kafka operation, please check the official document.