Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falling into Crawl Traps #53

Open
fuzzygroup opened this issue Apr 8, 2017 · 10 comments
Open

Falling into Crawl Traps #53

fuzzygroup opened this issue Apr 8, 2017 · 10 comments

Comments

@fuzzygroup
Copy link

Hi,

To start, thank you for an excellent piece of work. Appreciated.

I'm trying to use this to crawl the site http://www.udemy.com/ . I added it to my Gemfile, bundle install and started it up and everything looked great. What I found was that it fell victim to crawl traps generating urls like this:

https://www.udemy.com/courses/photography/mobile-photography/all-courses/?p=324

The actual number pages on the base url:

https://www.udemy.com/courses/photography/mobile-photography/all-courses/

is only 3 so its very much spidering far, far deeper than needed.

Any suggestions for how to go about addressing this?

What I'm trying to do is build a page_archiver and my core loop looks like this (its being executed from a Rake task):

statistics = CobwebCrawler.new(:cache => 600, :thread_count => 10, :valid_mime_types => ["text/html"]).crawl("http://www.udemy.com") do |page|
  puts "Just crawled #{page[:url]} and got a status of #{page[:status_code]}."
  if page[:mime_type] == "text/html"
    page_ctr = page_ctr + 1
    #puts page.title
    #debugger
    page_archive = PageArchive.find_or_create(page[:body], page[:url])
    total_time = (Time.now - start_time) 
    puts "  Total time: #{total_time}"
    puts "  Total pages: #{page_ctr}"
    puts "  Time per page: #{total_time.to_f / page_ctr}"
  else
    puts "  Not text/html for: #{page[:url]}"
  end
  
end

After running it for about 20 minutes, it got 10,000 "pages" deep almost all of which was just "psuedo pages" like the ?p=324 url.

I didn't see any kind of configuration option that would limit progress so this feels like something internal to the guts of the crawler but if I've missed something, my bad.

Thanks
Scott

@stewartmckee
Copy link
Owner

As a quick solution you could exclude the page 4 url? I'll look at it in more detail when i get back home.

@stewartmckee
Copy link
Owner

Yeah, if you add to your external_urls config option the url for page 4 then it should exclude that page and assuming the subsequent pages are only crawled because its gone into page 4 then it will stop at that point. A better solution would be giving the ability to prevent the subsequent crawl of links within the current page as you are processing it. This would mean you can detect this issue, maybe due to no results in the page, and mark it as not to be processed.

@fuzzygroup
Copy link
Author

Hi Stewart,

I'm more than willing to take a stab at fixing this. Mind giving me a pointer as to where to best start so I don't make a hash of your nice work?

@stewartmckee
Copy link
Owner

I think that possibly the best solution would be to include the internal_links data in the hash passed to the block, that way in your code you can add and remove items to and from the list, giving you control over what next steps the crawler will take. That would mean moving the yield above the internal_links.each call on line 121, and changing "internal_links.each" to "content[:internal_links].each" so that updates within the block are now used.

So in your scenario, you would detect there are no items returned in the page body and remove the "?p=" link.

Thats my thinking just now, haven't had a chance to try it out though yet.

@fuzzygroup
Copy link
Author

So I've been thinking about the solution you propose and I feel like quite a jerk. I was about to tell you that this was a Cobweb level issue -- but its not. The site itself is buggy:

https://www.udemy.com/courses/photography/mobile-photography/all-courses/

and it has a link to:

So you goto https://www.udemy.com/courses/photography/mobile-photography/all-courses/?p=2"

and it has a link to:

So you goto https://www.udemy.com/courses/photography/mobile-photography/all-courses/?p=3

and it has a link to:

which doesn't exist -- YEP -- this is a site level bug and it HAS to be handled at my specific application logic. The site is generating an infinite succession of pagination offsets even though the there are only 3 pages. Sheesh. Your code is doing things exactly right; you probably knew that and I should have dug deeper before even raising the issue. Apologies.

The only question becomes how does this get handled at a extensible approach without my having to maintain my own fork -- and I don't have a great answer for this. The only thing that comes to mind is some kind of is_link_valid? method but given that this is javascript pagination, even knowing that the link is invalid is hard.

Kudos by the way for supporting . I don't think I've ever put that into any crawlers I wrote from scratch.

The only question becomes how does this get handled at a extensible approach without my having to maintain my own fork -- and I don't have a great answer for this. The only thing that comes to mind is some kind of is_link_valid? method. Another possibility is maybe not following <link tags at all -- but then I see no way to navigate this particular succession of content.

I was able to confirm that one other gem, spidr, has this exact same problem which doubly confirms that the site is actually buggy. Interestingly Google doesn't have this issue - perhaps it is because they simply aren't indexing the pagination but that's odd for Google.

This is really a problem of identifying duplicate content. An easy way to solve this might be to look at a signature on just the html from the <body tag forward. I took two of these pages that were technically invalid -- the ?p=4 and ?p=5 and did a wget on them. Then I diff'ed them and the only difference was the single <link tag that was invalid. Then I noticed that the <link tag was in the element. So I removed the html up to the <body and diff'ed them again and at that point they were the same page i.e. duplicate content.

So one possible approach might be to keep an SHA hash of the content from <body forward and compare to see if this has already been processed -- or matched the last thing processed (much, much harder in a threaded crawler; been there; fought that battle).

Thoughts? I can certainly hack in fix for my own needs but it likely means maintaining a fork indefinitely which always sucks and identifying duplicate content is really a core crawler issue. Let me know if you're interested in this. I really like cobweb and its the first open source crawler I've found that really meets my needs so I'm willing to help get this addressed if you're interested in the help.

Thank you so much.

@fuzzygroup
Copy link
Author

Hey Stewart -- any thoughts about what I wrote up above? I haven't forked anything yet and I'd really prefer not to because I truly think that this belongs in the core.

@stewartmckee
Copy link
Owner

There isn't any issue with forking, it's the normal process for adding to open source projects. If you fork, you can work on adding functionality, documentation or anything within the codebase. Make sure you work in a feature branch and then when you are ready, you can request a pull request into the core repository, we can then work on if anything needs to change and then it gets merged into core.

I'll have a quick go at it today, and let you know how it goes, otherwise, feel free to fork and make changes, as long as it is in a branch its easy to merge back into core.

@fuzzygroup
Copy link
Author

I think there is properly done forking -- and you're correct about that -- and there is forking where your goal is to fix something for your own needs. My concern is that this feels to me like a core issue that needs to get addressed by enhancing the core structure and if someone unfamiliar with the code does it (me), they're likely to screw it up badly.

@fuzzygroup
Copy link
Author

Just to let you know Stewart I am tackling this and I think I have a fairly elegant approach (at least in my head; implementation is where it gets icky).

@fuzzygroup
Copy link
Author

On my fork I've placed a wiki entry with the start of my proposed change: https://github.com/fuzzygroup/cobweb/wiki/Duplicate-Content-Detection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants