Skip to content

Camping script to strip dross from web pages I'm interested in

Notifications You must be signed in to change notification settings

PeeDee/webstrip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebStrip

A simple web app to filter unwanted crap out of web-pages.

The idea is to take a url, go there and pull the page's html text. For pages that are the first of a series I pull the link block. For pages with a "next" I follow the chain. Then for each page I pull the divs I actually want, remove extraneous fluff - banners, scripts - and serve them up in some crude css.

PrintWhatYouLike.com does the same in a more interactive way.

I got this running in a single page app for about a dozen sites using "camping webstrip.rb" and localhost:3301. But now after a brief flirtation with Sinatra I'm basing directly on Rack as that seems much less hassle to run on Dreamhost using Passenger. I'm trying to avoid all the dependencies that I can, but you must have the Hpricot and open-uri gems and of course, Rack running on Passenger (which DH provides).

To install just checkout the whole tree and point a domain at webstrip/public.

A little better css in style_block() would be nice.

The setup is modular, so once you've cloned the repo you can add sites by just dropping their def files into the "sites" directory. The page router loads a ruby file named "#{URI(url).hostname}.rb" replacing '.'s with '_'s, instantiates a new object of the same class name passing the url to it, and then sends this object a request for filtered page content. Each def file defines a sub-class of WebStrip (or LinkedPageSeries or ChainedPageSeries) and overrides clean_page(doc) which returns a text string of html given the Hpricot document parsed from the url.

Class Hierarchy

PageRouter - implements basic response to rack call(), routes page
WebStrip - class with generic routines, must override clean_page
  |-- LinkedPageSeries - handles articles with a link block
  | '-- www_..._com - your sub-classes here
  |-- ChainedPageSeries - handles articles with a "next" link
  | '-- www_..._com - your sub-classes here
  '-- www_..._com - your sub-classes here

File Layout

webstrip
  |-- README.markdown
  |-- config.ru
  |-- page_router.rb
  |-- public
  |   |-- favicon.gif
  |   `-- favicon.ico
  |-- sites
  |   `-- www_..._com.rb
  |-- tmp
  |   `-- restart.txt
  `-- webstrip.rb

About

Camping script to strip dross from web pages I'm interested in

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages