-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to use ScrapedPage subclasses #14
base: master
Are you sure you want to change the base?
Conversation
f2feae0
to
a2da5d1
Compare
These pick up the other fields that the existing scraper was getting.
c46019e
to
bf6b8a4
Compare
Use the URI module to make it easier to just replace the unwanted part of the query string.
This means that rather than having to override the url in every class in the system we can just do it once here and then inherit from this class.
This should become part of ScapedPage RSN.
This means these pages also have the session information stripped from the url.
Rather than just scraping the first page, scrape all pages.
On the last page there is no next link, so trying to call `next[:href]` blows up. Instead I've moved `next_page_link` to a separate method then checking that exists in `next_page_url`.
bf6b8a4
to
4fb737a
Compare
Most gems we need now come as dependencies of scraped_page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments in place, but the bigger problem here is that this doesn't seem to be fetching the history for each person.
Each person page has a 'Todas las legislaturas' link that goes to a page that lists all the different term memberships they've had: e.g. http://www.congreso.es/portal/page/portal/Congreso/Congreso/Diputados/BusqForm?idDiputado=174&idLegislatura=7
gem "capybara" | ||
gem "poltergeist" | ||
gem 'scraped_page_archive', github: "everypolitician/scraped_page_archive", branch: "master" | ||
gem 'scraped_page', github: 'everypolitician/scraped_page', branch: 'master' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps worth running a rubocop tidy against this file too, so we don't have mismatched quotes like this?
|
||
class MemberPage < SpanishCongressPage | ||
field :iddiputado do | ||
query['idDiputado'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query
is a little more ambiguous in this context than when you're already explicitly dealing with a URL. Perhaps something like query_string
would be a little more obvious?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, I think query_string
is much clearer, will change it to that.
end | ||
|
||
def group | ||
@group ||= noko.at_css('div#curriculum div.texto_dip ul li div.dip_rojo:last').text.tidy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it really worth memoizing all these things? I don't think they're particularly costly, and some of them aren't even called more than once anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just habit, but you're quite right, this is a form of unnecessary premature optimization.
# frozen_string_literal: true | ||
require_relative 'spanish_congress_page' | ||
|
||
class MembersListPage < SpanishCongressPage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 'ScrapedPage' with no fields is a bit of a red flag. The first two of these look like they should be fields to me (and the next_page_link
a private method).
This caches a copy of fetched webpages to make the scraper quicker to develop locally.
This changes the scraper to use
ScrapedPage
subclasses, which now handle stripping the session information from the url.Notes to reviewer
You can see the scraper doing things by running the following:
This is currently missing the archiver because I've hit a couple of
scraped_page_archive
bugs relating to VCR (I think because the archive branch on this repo contains some non-standard cassettes, so it was a good test-case!).Notes to merger
Set https://morph.io/everypolitician-scrapers/spain_congreso_es to auto-run once this is merged.