Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Scrape all memberships using cookies #17

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

ondenman
Copy link

@ondenman ondenman commented Mar 14, 2017

Note:
This scraper depends on the correct cookies being sent with each request to a memberships list page. The local cache will need clearing out if you're running this branch for the first time.

Step one:
The scraper scrapes each member page: data from the main profile tab and additional membership information from the 'all memberships' tab.

Step two:
Additional memberships are merged with the person data from the member profile page to tab to create separate membership hashes.

chrismytton and others added 14 commits February 20, 2017 13:30
This maps between a Spanish date and an ISO representation of that date.
Following links with session information doesn't work correctly with
Ruby's open-uri, so rather than using Capybara this simply strips away
the session information.
This is the scraper class for the individual pages. It's a rough port of
the old procedural version, but tidied up to take advantage of Scraped's
class approach.
This is responsible for getting the lists of members and handling
pagination.
This switches over to the new Scraped-based classes.
…factor

This branch contains a fix for the AbsoluteUrl decorator where an
exception is raised when an invalid `mailto:` link is encountered. We've
bumped into that issue, so point to that branch to avoid it for now.
This enables a request to be sent with cookie.
Represents a list of memberships on a member’s profile page
Represents a single membership in the membership list.
Copy link
Contributor

@tmtmtmtm tmtmtmtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is still WIP, but I thought it worth adding some comments at this point…

@@ -15,3 +15,4 @@ gem "colorize"
gem "capybara"
gem "poltergeist"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably these two can be removed now?

(NB: the inconsistent formatting here implies that this Gemfile also needs brought up to current standards, perhaps as a separate PR?)

require 'scraped'
require_relative 'remove_session_from_url_decorator'
require_relative 'parsed_date_string'
require_relative 'strategies/live_request_with_cookie'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems odd that there's a 'strategies/' directory, but not, say, a 'decorators/'. I don't really care whether everything is in separate directories or all in one, but consistency would be good.

end_date = noko.xpath('.//div[@class="dip_rojo"][contains(.,"Causó baja")]')
.text.match(/(\d+)\/(\d+)\/(\d+)\./)
return if end_date.nil?
end_date.captures.reverse.join('-')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this code is repeated, and we've already factored out one Date parsing class, probably a good plan to factor this out too.

field :photo do
foto = noko.at_css('#datos_diputado img[name="foto"]')
return if foto.nil?
foto[:src]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our normal approach here is to fetch the src as part of the css call (i.e. img[name="foto"]/@src). If there's a reason why that doesn't work here, it's probably worth a comment as to why.

end

field :memberships_list do
req = Scraped::Request.new(url: memberships_url, strategies: [{ strategy: LiveRequestWithCookie, cookie: cookie }])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very out of place here. This conceptually doesn't seem like it's the responsibility of this class, and in practice also seems like it's largely an inversion of what we'd want the flow to be.

end

field :faction do
constituency_and_faction_line.rpartition('(').last.split(')').first.tidy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what this line looks like (a comment would be handy), but immediately calling .first (or .last) on .rpartition is a bit of a warning sign that that's probably not the best approach. If you had a separate method that was splitting the line up, which :constituency and :faction each used to get the part they cared about, that might make more sense. But without knowing what this line can look like, it's hard to know what the best approach is likely to be. In general, though, splitting on '(' and then again on ')' is likely to not really be a great approach, unless you're very deliberately needing to cope with odd layouts, potentially with oddly nested brackets (in which case you need more explanation.) This is one where a regex is generally going to be a lot clearer (especially when moved to a method that gets reused across both places that have to parse this line)

end

def month(str)
['', 'enero', 'febrero', 'marzo', 'abril', 'mayo', 'junio', 'julio', 'agosto', 'septiembre', 'octubre', 'noviembre', 'diciembre'].find_index(str.downcase) || raise("Unknown month #{str}".magenta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much doubt we actually want that .magenta there — we should be getting rid of colorize where possible, as it doesn't really play well with Morph.

end
mem.delete(:memberships_list)
all_memberships.push mem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding it very difficult to follow this logic, but as I suspect the whole approach is suspect due to the misplaced memberships_list field, I'm not sure it's worth digging too deeply at this point.

unless next_page.nil?
scrape_people(next_page['href'])
def person_data(data)
unwanted_keys = %i(party faction faction_id start_date end_date constituency term memberships_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, if you're going to build a list solely for the purpose of looking things up in it, you should make it a Set instead. But the whole approach here seems slightly off to me, though I can't really tell what you're trying to do.

ScraperWiki.save_sqlite([:id, :term], data)
ScraperWiki.sqliteexecute('DELETE FROM data') rescue nil
merged_memberships.each do |mem|
ScraperWiki.save_sqlite([:name, :term], mem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keying on name here seems wrong — can you guarantee that there have never been two members with the same name?

(And can a Member only have a single membership per term?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants