Medusa is a framework for the ruby language to crawl and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
-
Choose the links to follow on each page with
focus_crawl
-
Multi-threaded design for high performance
-
Tracks
301
HTTP redirects -
Allows exclusion of URLs based on regular expressions
-
Records response time for each page
-
Obey robots.txt directives (optional, but recommended)
-
In-memory or persistent storage of pages during crawl, provided by Moneta
-
Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]
Medusa is versatile and to be used programatically, you can start with one or multiple URIs:
require 'medusa' Medusa.crawl('https://www.example.com', depth_limit: 2)
Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus:
require 'medusa' Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler| crawler.discard_page_bodies = some_flag # Persist all the pages state across crawl-runs. crawler.clear_on_startup = false crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0') crawler.skip_links_like(/private/) crawler.on_pages_like(/public/) do |page| logger.debug "[public page] #{page.url} took #{page.response_time} found #{page.links.count}" end # Use an arbitrary logic, page by page, to continue customize the crawling. crawler.focus_crawl(/public/) do |page| page.links.first end end
- moneta
-
for the key/value storage adapters
- nokogiri
-
for parsing the HTML of webpages
- robotex
-
for support of the robots.txt directives
To test and develop this gem, additional requirements are:
-
rspec
-
webmock
Medusa is a revamped version of the defunk anemone gem.
Copyright © 2009 Vertive, Inc.
Copyright © 2020 Mauro Asprea
Released under the MIT License