Skip to content

Latest commit

 

History

History
101 lines (65 loc) · 2.6 KB

README.md

File metadata and controls

101 lines (65 loc) · 2.6 KB

Scraperry

About

Scraperry is a small API application to grab headers and link's URLs from a webpage

Scraperry uses:

  1. Active Model Serializer for JSON generation
  2. Nokogiri to parse HTML
  3. Rails-API for backend
  4. RSpec for testing
  5. SuckerPunch for AsyncJob

Installation

Add this line to your application's Gemfile:

gem 'active_model_serializers', '~> 0.10.0'

And then execute:

$ bundle
$ bundle exec rake db:setup
$ bundle exec rake

Make sure all tests are green

List Parsed Web Pages

Definition:

GET api/pages/

Request Example:

curl -H 'Accept: application/vnd.scraperry.v1' \
http://localhost:3000/api/pages

Response Example:

[{"id":1,"url":"https://www.google.com","status":"requested","updated_at":"2016-08-05T10:50:43.747Z","headers":[],"links":[]},{"id":2,"url":"https://www.facebook.com","status":"requested","updated_at":"2016-08-05T13:24:51.460Z","headers":[],"links":[]},{"id":4,"url":"https://www.facebook.com","status":"parsed","updated_at":"2016-08-06T04:43:34.734Z","headers":[{"tag":"h1","content":"Facebook"},{"tag":"h2","content":"Javascript pada browser Anda tidak diaktifkan."},{"tag":"h2","content":"Pemeriksaan Keamanan"}],"links":[{"url":"https://ar-ar.facebook.com/"},{"url":"https://de-de.facebook.com/"},{"url":"https://developers.facebook.com/?ref=pf"},{"url":"https://en-gb.facebook.com/"},{"url":"https://es-la.facebook.com/"},{"url":"https://fr-fr.facebook.com/"},{"url":"https://ja-jp.facebook.com/"},{"url":"https://jv-id.facebook.com/"},{"url":"https://ko-kr.facebook.com/"},{"url":"https://messenger.com/"},{"url":"https://ms-my.facebook.com/"},{"url":"https://pt-br.facebook.com/"},{"url":"https://www.facebook.com/"},{"url":"https://www.facebook.com/help/568137493302217"}]}]

Parse a Web Page

Definition:

POST api/pages/

Request Example:

curl -H 'Accept: application/vnd.scraperry.v1' \
-H "Content-Type: application/json" \
-X POST \
-d '{ "page": { "url": "https://www.facebook.com" } }' \
http://localhost:3000/api/pages

Response Example:

{"id":5,"url":"https://www.facebook.com","status":"requested"}

Parsing is not done immediately, but queued on the background. You can check the results via List endpoint.

Only unique URL will be scraped and saved from the requested page.

Getting Help

If you find a bug, feel free to report an Issue. I'd be happy to discuss with you.

To Do

Things to be added:

  • Limit argument in List endpoint to limit number of web pages returned
  • Setup a live demo

Enjoy! - Taufiq -