Scraperry is a small API application to grab headers and link's URLs from a webpage
Scraperry uses:
- Active Model Serializer for JSON generation
- Nokogiri to parse HTML
- Rails-API for backend
- RSpec for testing
- SuckerPunch for AsyncJob
Add this line to your application's Gemfile:
gem 'active_model_serializers', '~> 0.10.0'
And then execute:
$ bundle
$ bundle exec rake db:setup
$ bundle exec rake
Make sure all tests are green
GET api/pages/
curl -H 'Accept: application/vnd.scraperry.v1' \
http://localhost:3000/api/pages
[{"id":1,"url":"https://www.google.com","status":"requested","updated_at":"2016-08-05T10:50:43.747Z","headers":[],"links":[]},{"id":2,"url":"https://www.facebook.com","status":"requested","updated_at":"2016-08-05T13:24:51.460Z","headers":[],"links":[]},{"id":4,"url":"https://www.facebook.com","status":"parsed","updated_at":"2016-08-06T04:43:34.734Z","headers":[{"tag":"h1","content":"Facebook"},{"tag":"h2","content":"Javascript pada browser Anda tidak diaktifkan."},{"tag":"h2","content":"Pemeriksaan Keamanan"}],"links":[{"url":"https://ar-ar.facebook.com/"},{"url":"https://de-de.facebook.com/"},{"url":"https://developers.facebook.com/?ref=pf"},{"url":"https://en-gb.facebook.com/"},{"url":"https://es-la.facebook.com/"},{"url":"https://fr-fr.facebook.com/"},{"url":"https://ja-jp.facebook.com/"},{"url":"https://jv-id.facebook.com/"},{"url":"https://ko-kr.facebook.com/"},{"url":"https://messenger.com/"},{"url":"https://ms-my.facebook.com/"},{"url":"https://pt-br.facebook.com/"},{"url":"https://www.facebook.com/"},{"url":"https://www.facebook.com/help/568137493302217"}]}]
POST api/pages/
curl -H 'Accept: application/vnd.scraperry.v1' \
-H "Content-Type: application/json" \
-X POST \
-d '{ "page": { "url": "https://www.facebook.com" } }' \
http://localhost:3000/api/pages
{"id":5,"url":"https://www.facebook.com","status":"requested"}
Parsing is not done immediately, but queued on the background. You can check the results via List endpoint.
Only unique URL will be scraped and saved from the requested page.
If you find a bug, feel free to report an Issue. I'd be happy to discuss with you.
Things to be added:
- Limit argument in List endpoint to limit number of web pages returned
- Setup a live demo
Enjoy! - Taufiq -