-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Scrape all memberships using cookies #17
base: master
Are you sure you want to change the base?
Conversation
This maps between a Spanish date and an ISO representation of that date.
Following links with session information doesn't work correctly with Ruby's open-uri, so rather than using Capybara this simply strips away the session information.
This is the scraper class for the individual pages. It's a rough port of the old procedural version, but tidied up to take advantage of Scraped's class approach.
This is responsible for getting the lists of members and handling pagination.
This switches over to the new Scraped-based classes.
…factor This branch contains a fix for the AbsoluteUrl decorator where an exception is raised when an invalid `mailto:` link is encountered. We've bumped into that issue, so point to that branch to avoid it for now.
This enables a request to be sent with cookie.
Represents a list of memberships on a member’s profile page
Represents a single membership in the membership list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is still WIP, but I thought it worth adding some comments at this point…
@@ -15,3 +15,4 @@ gem "colorize" | |||
gem "capybara" | |||
gem "poltergeist" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably these two can be removed now?
(NB: the inconsistent formatting here implies that this Gemfile also needs brought up to current standards, perhaps as a separate PR?)
require 'scraped' | ||
require_relative 'remove_session_from_url_decorator' | ||
require_relative 'parsed_date_string' | ||
require_relative 'strategies/live_request_with_cookie' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems odd that there's a 'strategies/' directory, but not, say, a 'decorators/'. I don't really care whether everything is in separate directories or all in one, but consistency would be good.
end_date = noko.xpath('.//div[@class="dip_rojo"][contains(.,"Causó baja")]') | ||
.text.match(/(\d+)\/(\d+)\/(\d+)\./) | ||
return if end_date.nil? | ||
end_date.captures.reverse.join('-') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this code is repeated, and we've already factored out one Date parsing class, probably a good plan to factor this out too.
field :photo do | ||
foto = noko.at_css('#datos_diputado img[name="foto"]') | ||
return if foto.nil? | ||
foto[:src] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our normal approach here is to fetch the src
as part of the css
call (i.e. img[name="foto"]/@src
). If there's a reason why that doesn't work here, it's probably worth a comment as to why.
lib/member_page.rb
Outdated
end | ||
|
||
field :memberships_list do | ||
req = Scraped::Request.new(url: memberships_url, strategies: [{ strategy: LiveRequestWithCookie, cookie: cookie }]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very out of place here. This conceptually doesn't seem like it's the responsibility of this class, and in practice also seems like it's largely an inversion of what we'd want the flow to be.
lib/membership_row.rb
Outdated
end | ||
|
||
field :faction do | ||
constituency_and_faction_line.rpartition('(').last.split(')').first.tidy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what this line looks like (a comment would be handy), but immediately calling .first
(or .last
) on .rpartition
is a bit of a warning sign that that's probably not the best approach. If you had a separate method that was splitting the line up, which :constituency
and :faction
each used to get the part they cared about, that might make more sense. But without knowing what this line can look like, it's hard to know what the best approach is likely to be. In general, though, splitting on '(' and then again on ')' is likely to not really be a great approach, unless you're very deliberately needing to cope with odd layouts, potentially with oddly nested brackets (in which case you need more explanation.) This is one where a regex is generally going to be a lot clearer (especially when moved to a method that gets reused across both places that have to parse this line)
end | ||
|
||
def month(str) | ||
['', 'enero', 'febrero', 'marzo', 'abril', 'mayo', 'junio', 'julio', 'agosto', 'septiembre', 'octubre', 'noviembre', 'diciembre'].find_index(str.downcase) || raise("Unknown month #{str}".magenta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much doubt we actually want that .magenta
there — we should be getting rid of colorize
where possible, as it doesn't really play well with Morph.
end | ||
mem.delete(:memberships_list) | ||
all_memberships.push mem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding it very difficult to follow this logic, but as I suspect the whole approach is suspect due to the misplaced memberships_list
field, I'm not sure it's worth digging too deeply at this point.
unless next_page.nil? | ||
scrape_people(next_page['href']) | ||
def person_data(data) | ||
unwanted_keys = %i(party faction faction_id start_date end_date constituency term memberships_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, if you're going to build a list solely for the purpose of looking things up in it, you should make it a Set instead. But the whole approach here seems slightly off to me, though I can't really tell what you're trying to do.
ScraperWiki.save_sqlite([:id, :term], data) | ||
ScraperWiki.sqliteexecute('DELETE FROM data') rescue nil | ||
merged_memberships.each do |mem| | ||
ScraperWiki.save_sqlite([:name, :term], mem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keying on name
here seems wrong — can you guarantee that there have never been two members with the same name?
(And can a Member only have a single membership per term?)
Note:
This scraper depends on the correct cookies being sent with each request to a memberships list page. The local cache will need clearing out if you're running this branch for the first time.
Step one:
The scraper scrapes each member page: data from the main profile tab and additional membership information from the 'all memberships' tab.
Step two:
Additional memberships are merged with the person data from the member profile page to tab to create separate membership hashes.