Skip to content

Web crawler using Chrome headless mode and chrome remote debug. Loggs all requests to stdout.

Notifications You must be signed in to change notification settings

gatorek/chrome-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chrome based web crawler

Web crawler using Chrome headless mode and chrome remote debug. Loggs all requests to stdout.

Prerequisities

  • Node and nmp
  • Chrome, version at least 60
  • Redis server running locally on standard port, without password

Set up

  • Download, using git clone https://bitbucket.org/gatorek/chrome-crawler.git ./ or got to Download section of tihs repo
  • Run npm install, which installs all neccessary node dependencies

Run

  • node crawler js -d 100 -g 4 -u http://url.to.scan > log.txt
  • node crawler.js --help gives you short usage instruction

TODO

  • add timestamps
  • add timeout option for chrome debug
  • automatically set chrome concurrency level, basing on CPU cores count
  • separate namespaces in redis for every instance of program
  • clean queue after break (i.e. Ctrl+C)

About

Web crawler using Chrome headless mode and chrome remote debug. Loggs all requests to stdout.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published