gatorek / chrome-crawler Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Web crawler using Chrome headless mode and chrome remote debug. Loggs all requests to stdout.

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
crawler.js		crawler.js
package.json		package.json
scan_one_url.js		scan_one_url.js

Repository files navigation

Chrome based web crawler

Web crawler using Chrome headless mode and chrome remote debug. Loggs all requests to stdout.

Prerequisities

Node and nmp
Chrome, version at least 60
Redis server running locally on standard port, without password

Set up

Download, using git clone https://bitbucket.org/gatorek/chrome-crawler.git ./ or got to Download section of tihs repo
Run npm install, which installs all neccessary node dependencies

Run

node crawler js -d 100 -g 4 -u http://url.to.scan > log.txt
node crawler.js --help gives you short usage instruction

TODO

add timestamps
add timeout option for chrome debug
automatically set chrome concurrency level, basing on CPU cores count
separate namespaces in redis for every instance of program
clean queue after break (i.e. Ctrl+C)

About

Web crawler using Chrome headless mode and chrome remote debug. Loggs all requests to stdout.

Report repository

Releases

No releases published

Packages

No packages published

Languages

JavaScript 100.0%