-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to use ScrapedPage subclasses #14
base: master
Are you sure you want to change the base?
Changes from all commits
42c6c3e
a2da5d1
454a4a5
44001c8
e08861d
a36bbcc
7c1c583
14bea9d
d48f099
5485aae
b3b1096
c7163ee
0e158b9
cddfd48
80b4a12
c25c4b0
4fb737a
b1c1884
4b7fe80
95cad54
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,3 @@ | ||
# Ignore output of scraper | ||
data.sqlite | ||
.cache |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# frozen_string_literal: true | ||
class String | ||
def tidy | ||
gsub(/[[:space:]]+/, ' ').strip | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# frozen_string_literal: true | ||
class DateOfBirth | ||
DATE_REGEX = /(?<day>\d+) de (?<month>[^[:space:]]*) de (?<year>\d+)/ | ||
|
||
def initialize(date_string) | ||
@date_string = date_string | ||
end | ||
|
||
def to_s | ||
return '' if match.nil? | ||
'%d-%02d-%02d' % [match[:year], month(match[:month]), match[:day]] | ||
end | ||
|
||
private | ||
|
||
attr_reader :date_string | ||
|
||
def match | ||
@match ||= date_string.match(DATE_REGEX) | ||
end | ||
|
||
def month(str) | ||
['', 'enero', 'febrero', 'marzo', 'abril', 'mayo', 'junio', 'julio', 'agosto', 'septiembre', 'octubre', 'noviembre', 'diciembre'].find_index(str.downcase) || raise("Unknown month #{str}".magenta) | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# frozen_string_literal: true | ||
require_relative 'spanish_congress_page' | ||
require_relative 'date_of_birth' | ||
require_relative 'core_ext' | ||
|
||
class MemberPage < SpanishCongressPage | ||
field :iddiputado do | ||
query['idDiputado'] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah good point, I think |
||
end | ||
|
||
field :term do | ||
query['idLegislatura'] | ||
end | ||
|
||
field :name do | ||
noko.css('div#curriculum div.nombre_dip').text | ||
end | ||
|
||
field :family_names do | ||
name.split(/,/).first.to_s.tidy | ||
end | ||
|
||
field :given_names do | ||
name.split(/,/).last.to_s.tidy | ||
end | ||
|
||
field :gender do | ||
return 'female' if seat.include? 'Diputada' | ||
return 'male' if seat.include? 'Diputado' | ||
end | ||
|
||
field :party do | ||
noko.at_css('#datos_diputado .nombre_grupo').text.tidy | ||
end | ||
|
||
field :source do | ||
url.to_s | ||
end | ||
|
||
field :dob do | ||
DateOfBirth.new( | ||
noko.xpath('.//div[@class="titular_historico"]/following::div/ul/li').first.text | ||
).to_s | ||
end | ||
|
||
field :faction do | ||
faction_information[:faction].to_s.tidy | ||
end | ||
|
||
field :faction_id do | ||
faction_information[:faction_id].to_s.tidy | ||
end | ||
|
||
field :start_date do | ||
start_date = noko.xpath('.//div[@class="dip_rojo"][contains(.,"Fecha alta")]') | ||
.text.match(/(\d+)\/(\d+)\/(\d+)\./) | ||
return if start_date.nil? | ||
start_date.captures.reverse.join('-') | ||
end | ||
|
||
field :end_date do | ||
end_date = noko.xpath('.//div[@class="dip_rojo"][contains(.,"Causó baja")]') | ||
.text.match(/(\d+)\/(\d+)\/(\d+)\./) | ||
return if end_date.nil? | ||
end_date.captures.reverse.join('-') | ||
end | ||
|
||
field :email do | ||
noko.css('.webperso_dip a[href*="mailto"]').text.tidy | ||
end | ||
|
||
field :twitter do | ||
noko.css('.webperso_dip a[href*="twitter.com"]').text.tidy | ||
end | ||
|
||
field :facebook do | ||
noko.css('.webperso_dip a[href*="facebook.com"]').text.tidy | ||
end | ||
|
||
field :phone do | ||
noko.css('.texto_dip').text.match(/Teléfono: (.*)$/).to_a.last.to_s.tidy | ||
end | ||
|
||
field :fax do | ||
noko.css('.texto_dip').text.match(/Fax: (.*)$/).to_a.last.to_s.tidy | ||
end | ||
|
||
field :constituency do | ||
seat[/Diputad. por (.*)\./, 1] | ||
end | ||
|
||
field :photo do | ||
foto = noko.at_css('#datos_diputado img[name="foto"]') | ||
return if foto.nil? | ||
URI.join(url, foto[:src]).to_s | ||
end | ||
|
||
private | ||
|
||
def seat | ||
@seat ||= noko.at_css('div#curriculum div.texto_dip ul li div.dip_rojo:first').text.tidy | ||
end | ||
|
||
def group | ||
@group ||= noko.at_css('div#curriculum div.texto_dip ul li div.dip_rojo:last').text.tidy | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it really worth memoizing all these things? I don't think they're particularly costly, and some of them aren't even called more than once anyway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, just habit, but you're quite right, this is a form of unnecessary premature optimization. |
||
end | ||
|
||
def query | ||
@query ||= URI.decode_www_form(URI.parse(url).query).to_h | ||
end | ||
|
||
def faction_information | ||
@faction_information ||= group.match(/(?<faction>.*?) \((?<faction_id>.*?)\)/) || {} | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# frozen_string_literal: true | ||
require_relative 'spanish_congress_page' | ||
|
||
class MembersListPage < SpanishCongressPage | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A 'ScrapedPage' with no fields is a bit of a red flag. The first two of these look like they should be fields to me (and the |
||
def member_urls | ||
@member_urls ||= noko.css('div#RESULTADOS_DIPUTADOS div.listado_1 ul li a').map { |p| p[:href] } | ||
end | ||
|
||
def next_page_url | ||
next_page_link && next_page_link[:href] | ||
end | ||
|
||
def next_page_link | ||
@next_page_url ||= noko.css('//div[@class = "paginacion"]//a[contains("Página Siguiente")]').first | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# frozen_string_literal: true | ||
require 'scraped_page' | ||
require 'uri' | ||
|
||
class SpanishCongressPage < ScrapedPage | ||
# Remove session information from url | ||
def url | ||
uri = URI.parse(super.to_s) | ||
return uri.to_s unless uri.query | ||
uri.query = uri.query.gsub(/_piref[\d_]+\./, '') | ||
uri.to_s | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps worth running a rubocop tidy against this file too, so we don't have mismatched quotes like this?