Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not run the test crawl #21

Open
madankb opened this issue May 11, 2014 · 10 comments
Open

Could not run the test crawl #21

madankb opened this issue May 11, 2014 · 10 comments

Comments

@madankb
Copy link

madankb commented May 11, 2014

  1. Started redis-server

  2. bundle exec ./test/test_crawl.rb -u http://calculatedcontent.com gives below mentioned error.
    /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser/scanner.rb:19:in process': Sourcify::NoMatchingProcError (Sourcify::NoMatchingProcError) from cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:40:inextracted_source'
    from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:22:in sexp' from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:17:insource'
    from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/methods/to_source.rb:39:in to_source' from /cloud-crawler/cloud-crawler/lib/cloud-crawler/driver.rb:234:incrawl'
    from /cloud-crawler/cloud-crawler/lib/cloud-crawler/driver.rb:49:in standalone_crawl' from ./test/test_crawl.rb:27:in

    '

I am using ruby version 2.1.1.

@charlesmartin14
Copy link
Member

it looks like sourcify is not working properly under ruby 2.1.1

we need to check that this works properly

https://github.com/CalculatedContent/sourcify

or see if we need to migrate to a newer version

https://github.com/ngty/sourcify

the basic design pattern for the crawler is described here

http://charlesmartin14.wordpress.com/2013/08/10/a-ruby-dsl-design-pattern-for-distributed-computing/

@charlesmartin14
Copy link
Member

the first step to do is write some small tests and verify that sourcify is working

@madankb
Copy link
Author

madankb commented May 12, 2014

I updated my sourcify gem version from 0.5 to 0.6. Then I ran the below mentioned test programs

#1:-
require 'sourcify'

def block_to_s(&blk)
blk.to_source(:strip_enclosure => true)
end

puts block_to_s {
str = "Hello"
str.reverse!
print str
}

Output:-

str = "Hello"
str.reverse!
print(str)

#2:-

require 'rubygems'
require 'bundler/setup'
require 'cloud-crawler'
require 'trollop'

opts = Trollop::options do
opt :urls, "urls to crawl", :short => "-u", :multi => true, :default => "http://www.ehow.com"
end

urls = ["http://www.crossfit.com"]
CloudCrawler::crawl(urls, opts) do |cc|
cc.focus_crawl do |page|
page.links.keep_if do |lnk|
text_for(lnk) =~ /Level 1/i
end
end
cc.on_every_page do |page|
puts page.url.to_s
end
end

Output :-

/.rvm/gems/ruby-2.1.1@global/gems/bundler-1.5.3/lib/bundler/runtime.rb:220: warning: Insecure world writable dir /usr/local in PATH, mode 040777
/.rvm/gems/ruby-2.1.1@global/gems/bundler-1.5.3/lib/bundler/runtime.rb:220: warning: Insecure world writable dir /usr/local in PATH, mode 040777
I, [2014-05-12T22:26:56.313418 #3636] INFO -- : crawl ["http://www.crossfit.com"] with proc do |cc|
cc.focus_crawl do |page|
page.links.keep_if { |lnk| text_for(lnk) =~ /Level 1/i }
end
cc.on_every_page { |page| puts(page.url.to_s) }
end
I, [2014-05-12T22:26:56.319176 #3636] INFO -- : initialzing driver for cc
I, [2014-05-12T22:26:56.319305 #3636] INFO -- : loading crawl job = {:url=>"http://www.crossfit.com"}
I, [2014-05-12T22:26:56.327747 #3636] INFO -- : keys on ccmq ["dsl_blocks:2", "auto_dsl_id", "dsl_blocks:1"]
I, [2014-05-12T22:26:56.327813 #3636] INFO -- : submitting CloudCrawler::CrawlJob single (non recurring) job

Previously I was getting error with sourcify version 0.5. I am still facing the same error with test_crawl.rb.

@charlesmartin14
Copy link
Member

The sourcify gems probably don't work . We used our own , forked version of sourcify because of this, although it might not be working properly in ruby 2.1

Ill see if I can reproduce the error

@madankb madankb closed this as completed May 13, 2014
@madankb madankb reopened this May 13, 2014
@charlesmartin14
Copy link
Member

this is the forked version with the bug fixes

https://github.com/CalculatedContent/sourcify

this should be what bundler installs

@madankb
Copy link
Author

madankb commented May 13, 2014

I tried sourcify from both https://github.com/CalculatedContent/sourcify and https://github.com/ngty/sourcify (Changing the Gemfile). But I am getting the same error. I may need to try installing ruby 1.9.3.

@charlesmartin14
Copy link
Member

that is, it is necessary to move to ruby 2.1 so it is useful to look carefully at what is working and what is not

we need to isolate where the bug is
is the bug in sourcify itself?

@charlesmartin14
Copy link
Member

but generally yes...the requirements are ruby 1.9.7

@charlesmartin14
Copy link
Member

to install 1.9.7, i suggest using rvm
this makes it ver easy

@lucaswxp
Copy link

lucaswxp commented Jul 8, 2015

Same problem here, and I'm using ruby 1.9.7 with rvm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants