WIP: Scrape all memberships using cookies #17

ondenman · 2017-03-14T18:01:57Z

Note:
This scraper depends on the correct cookies being sent with each request to a memberships list page. The local cache will need clearing out if you're running this branch for the first time.

Step one:
The scraper scrapes each member page: data from the main profile tab and additional membership information from the 'all memberships' tab.

Step two:
Additional memberships are merged with the person data from the member profile page to tab to create separate membership hashes.

This maps between a Spanish date and an ISO representation of that date.

Following links with session information doesn't work correctly with Ruby's open-uri, so rather than using Capybara this simply strips away the session information.

This is the scraper class for the individual pages. It's a rough port of the old procedural version, but tidied up to take advantage of Scraped's class approach.

This is responsible for getting the lists of members and handling pagination.

This switches over to the new Scraped-based classes.

…factor This branch contains a fix for the AbsoluteUrl decorator where an exception is raised when an invalid `mailto:` link is encountered. We've bumped into that issue, so point to that branch to avoid it for now.

This enables a request to be sent with cookie.

Represents a list of memberships on a member’s profile page

Represents a single membership in the membership list.

tmtmtmtm

I know this is still WIP, but I thought it worth adding some comments at this point…

tmtmtmtm · 2017-03-14T22:39:33Z

Gemfile

@@ -15,3 +15,4 @@ gem "colorize"
 gem "capybara"
 gem "poltergeist"


Presumably these two can be removed now?

(NB: the inconsistent formatting here implies that this Gemfile also needs brought up to current standards, perhaps as a separate PR?)

tmtmtmtm · 2017-03-14T22:40:48Z

lib/member_page.rb

+require 'scraped'
+require_relative 'remove_session_from_url_decorator'
+require_relative 'parsed_date_string'
+require_relative 'strategies/live_request_with_cookie'


it seems odd that there's a 'strategies/' directory, but not, say, a 'decorators/'. I don't really care whether everything is in separate directories or all in one, but consistency would be good.

tmtmtmtm · 2017-03-14T22:41:04Z

lib/member_page.rb

+    end_date = noko.xpath('.//div[@class="dip_rojo"][contains(.,"Causó baja")]')
+                   .text.match(/(\d+)\/(\d+)\/(\d+)\./)
+    return if end_date.nil?
+    end_date.captures.reverse.join('-')


As this code is repeated, and we've already factored out one Date parsing class, probably a good plan to factor this out too.

tmtmtmtm · 2017-03-14T22:41:09Z

lib/member_page.rb

+  field :photo do
+    foto = noko.at_css('#datos_diputado img[name="foto"]')
+    return if foto.nil?
+    foto[:src]


Our normal approach here is to fetch the src as part of the css call (i.e. img[name="foto"]/@src). If there's a reason why that doesn't work here, it's probably worth a comment as to why.

tmtmtmtm · 2017-03-14T22:41:14Z

lib/member_page.rb

+  end
+
+  field :memberships_list do
+    req = Scraped::Request.new(url: memberships_url, strategies: [{ strategy: LiveRequestWithCookie, cookie: cookie }])


This seems very out of place here. This conceptually doesn't seem like it's the responsibility of this class, and in practice also seems like it's largely an inversion of what we'd want the flow to be.

tmtmtmtm · 2017-03-14T22:47:44Z

lib/membership_row.rb

+  end
+
+  field :faction do
+    constituency_and_faction_line.rpartition('(').last.split(')').first.tidy


I don't know what this line looks like (a comment would be handy), but immediately calling .first (or .last) on .rpartition is a bit of a warning sign that that's probably not the best approach. If you had a separate method that was splitting the line up, which :constituency and :faction each used to get the part they cared about, that might make more sense. But without knowing what this line can look like, it's hard to know what the best approach is likely to be. In general, though, splitting on '(' and then again on ')' is likely to not really be a great approach, unless you're very deliberately needing to cope with odd layouts, potentially with oddly nested brackets (in which case you need more explanation.) This is one where a regex is generally going to be a lot clearer (especially when moved to a method that gets reused across both places that have to parse this line)

tmtmtmtm · 2017-03-14T22:48:01Z

lib/parsed_date_string.rb

+  end
+
+  def month(str)
+    ['', 'enero', 'febrero', 'marzo', 'abril', 'mayo', 'junio', 'julio', 'agosto', 'septiembre', 'octubre', 'noviembre', 'diciembre'].find_index(str.downcase) || raise("Unknown month #{str}".magenta)


I very much doubt we actually want that .magenta there — we should be getting rid of colorize where possible, as it doesn't really play well with Morph.

tmtmtmtm · 2017-03-14T22:48:05Z

scraper.rb

  end
+  mem.delete(:memberships_list)
+  all_memberships.push mem


I'm finding it very difficult to follow this logic, but as I suspect the whole approach is suspect due to the misplaced memberships_list field, I'm not sure it's worth digging too deeply at this point.

tmtmtmtm · 2017-03-14T22:48:45Z

scraper.rb

-  unless next_page.nil?
-    scrape_people(next_page['href'])
+def person_data(data)
+  unwanted_keys = %i(party faction faction_id start_date end_date constituency term memberships_list)


In general, if you're going to build a list solely for the purpose of looking things up in it, you should make it a Set instead. But the whole approach here seems slightly off to me, though I can't really tell what you're trying to do.

tmtmtmtm · 2017-03-14T22:52:07Z

scraper.rb

-  ScraperWiki.save_sqlite([:id, :term], data)
+ScraperWiki.sqliteexecute('DELETE FROM data') rescue nil
+merged_memberships.each do |mem|
+  ScraperWiki.save_sqlite([:name, :term], mem)


Keying on name here seems wrong — can you guarantee that there have never been two members with the same name?

(And can a Member only have a single membership per term?)

chrismytton and others added 14 commits February 20, 2017 13:30

Add scraped dependency

99b14ab

Add ParsedDateString class

5d50c22

This maps between a Spanish date and an ISO representation of that date.

Add decorator to strip session info from URLs

9b93768

Following links with session information doesn't work correctly with Ruby's open-uri, so rather than using Capybara this simply strips away the session information.

Add MemberPage class

b1b6d5e

This is the scraper class for the individual pages. It's a rough port of the old procedural version, but tidied up to take advantage of Scraped's class approach.

Add MembersListPage class

1ef6dfa

This is responsible for getting the lists of members and handling pagination.

Update scraper.rb to use new Scraped classes

143fae9

This switches over to the new Scraped-based classes.

Ignore open-uri-cache directory

946434f

fixup! Update scraper.rb to use new Scraped classes

669a070

Add LiveRequestWithCookie strategy

473442c

This enables a request to be sent with cookie.

Add memberships_list field

9577291

Add MembershipsList

8dd2863

Represents a list of memberships on a member’s profile page

Add MembershipRow

f6cba4a

Represents a single membership in the membership list.

Scrape all memberships

0efa4f7

tmtmtmtm reviewed Mar 14, 2017

View reviewed changes

Oliver Denman added 6 commits March 15, 2017 17:48

Look up term id using aray index

bf7b93f

Use regex to capture faction name

ad4dbab

Handle constituency and faction properly

565ad62

Extract Cookie class

f2e4223

fixup! Handle constituency and faction properly

a68163d

Tidy and comments

2579904

tmtmtmtm mentioned this pull request Mar 20, 2017

Add MembersPage class everypolitician-scrapers/spain-congreso#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Scrape all memberships using cookies #17

WIP: Scrape all memberships using cookies #17

ondenman commented Mar 14, 2017 •

edited

Loading

tmtmtmtm left a comment

tmtmtmtm Mar 14, 2017

tmtmtmtm Mar 14, 2017

tmtmtmtm Mar 14, 2017

tmtmtmtm Mar 14, 2017

tmtmtmtm Mar 14, 2017

tmtmtmtm Mar 14, 2017

tmtmtmtm Mar 14, 2017

tmtmtmtm Mar 14, 2017

tmtmtmtm Mar 14, 2017

tmtmtmtm Mar 14, 2017

		@@ -15,3 +15,4 @@ gem "colorize"
		gem "capybara"
		gem "poltergeist"

WIP: Scrape all memberships using cookies #17

Are you sure you want to change the base?

WIP: Scrape all memberships using cookies #17

Conversation

ondenman commented Mar 14, 2017 • edited Loading

tmtmtmtm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ondenman commented Mar 14, 2017 •

edited

Loading