Floodesh is middleware based web spider written with Nodejs. "Floodesh" is a combination of two words, flood
and mesh
Make sure g++
, make
, libboost-all-dev
, gperf
, libevent-dev
and uuid-dev
have been installed.
wget https://launchpad.net/gearmand/1.2/1.1.12/+download/gearmand-1.1.12.tar.gz | tar xvf
cd gearmand-1.1.12
make install
$ npm install -g floodesh-cli
Generate new app from templates by only one command.
$ mkdir floodesh_demo
$ cd floodesh_demo
$ floodesh-cli init // all necessary files will be generated in your directory.
Please make sure you have /data/tests and /var/log/bda/tests created and have Write access before use, you can customize path by modifying logBaseDir in config/[env]/index.js
A context instance is a kind of Finite-State Machine implemented by Generators
which is ECMAScript 6 feature. By context, we can access almost all fields in response
and request
, like:
worker.use( (ctx,next) => {
ctx.content = ctx.body.toString(); // totally do not care about the body
return next();
- String
Get querystring.
- Boolean
Check if the request is idempotent.
- String
Get the search string. It includes the leading "?" compare to querystring.
- String
Get request method.
- Object
Get parsed query-string.
- String
Get the request pathname
- String
Return request url, the same as ctx.href.
- String
Get the origin of URL, for instance, "https://www.google.com".
- String
Return the protocol string "http:" or "https:".
- String, hostname:port
Parse the "Host" header field host and support X-Forwarded-Host when a proxy is enabled.
- String
Parse the "Host" header field hostname and support X-Forwarded-Host when a proxy is enabled.
- Boolean
Check if protocol is https.
- Number
Get status code from response.
- String
Get status message from response.
- Buffer
Get the response body in Buffer.
- Number
Get length of response body.
- String
Get the response mime type, for instance, "text/html"
- Date
Get the Last-Modified date in Date form, if it exists.
- String
Get the ETag of a response.
- Object
Return the response header.
- String
String- Return: String
Get value by key in response headers
s String|Array- Return: String|false|null
Check if the incoming response contains the "Content-Type" header field, and it contains any of the give mime type
s.If there is no response body, null
is returned.If there is no content type, false
is returned.Otherwise, it returns the first type
that matches.
- Array
Array of pending crawling tasks. A task is an object consists of Options and next
, next
is a function name in your spider you want to call in next task , Supported format:
- Map
is a map to store result, that will be parsed and saved by floodesh.
- mof-cheerio: A simple wrapper of
. - mof-charsetparser: Parse
in response headers. - mof-iconv: Encoding converter middleware using
. - mof-request: A wrapper of
, with some default options. - mof-bottleneck: A wrapper of
which is asynchronous rate limiter with priority. - mof-proxy: With power to acquire proxy from a proxy service.
- mof-whacko: A wrapper of
, which is a fork of cheerio that uses parse5 as an underlying platform. - mof-statsd: A wrapper of
, which enables you send metrics to a statsd daemon. - mof-uarotate: Rotate
header automatically from a local file. - mof-seenreq: Only make sense in flowesh, a simple wrapper of
. - mof-validbody: Check if a response body meets a pattern, for instance, a html body should start with
and json body{
. - mof-statuscode: Status code detector.