Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added only_links_like Feature #50

Open
wants to merge 1 commit into
base: next
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions lib/anemone/core.rb
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ def initialize(urls, opts = {})
@on_every_page_blocks = []
@on_pages_like_blocks = Hash.new { |hash,key| hash[key] = [] }
@skip_link_patterns = []
@only_link_patterns = []
@after_crawl_blocks = []
@opts = opts

Expand Down Expand Up @@ -111,6 +112,15 @@ def skip_links_like(*patterns)
self
end

#
# Add one ore more Regex patterns for URLs which should only be
# followed
#
def only_links_like(*patterns)
@only_link_patterns.concat [patterns].flatten.compact
self
end

#
# Add a block to be executed on every Page as they are encountered
# during the crawl
Expand Down Expand Up @@ -292,10 +302,18 @@ def skip_query_string?(link)

#
# Returns +true+ if *link* should not be visited because
# its URL matches a skip_link pattern.
# its URL matches a skip_link pattern or not matches a only_link pattern.
#
def skip_link?(link)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@only_link_patterns.empty? 
 ? @skip_link_patterns.any? { |pattern| link.path =~ pattern }
 : !@only_link_patterns.any? { |pattern| link.path =~ pattern }

@hartator would u care to re-submit in the Medusa fork?

@skip_link_patterns.any? { |pattern| link.path =~ pattern }
unless @only_link_patterns.empty?
if @only_link_patterns.any? { |pattern| link.path =~ pattern }
return false
else
return true
end
else
@skip_link_patterns.any? { |pattern| link.path =~ pattern }
end
end

end
Expand Down
16 changes: 16 additions & 0 deletions spec/core_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,22 @@ module Anemone
core.pages.keys.should_not include(pages[1].url)
core.pages.keys.should_not include(pages[3].url)
end

it "should be able to follow only links based on a RegEx" do
pages = []
pages << FakePage.new('0', :links => ['1', '2'])
pages << FakePage.new('1')
pages << FakePage.new('2')
pages << FakePage.new('3')

core = Anemone.crawl(pages[0].url, @opts) do |a|
a.only_links_like /1/, /3/
end

core.should have(2).pages
core.pages.keys.should include(pages[1].url)
core.pages.keys.should_not include(pages[3].url)
end

it "should be able to call a block on every page" do
pages = []
Expand Down