.. currentmodule:: hepcrawl
Here is a introduction to spiders: http://doc.scrapy.org/en/latest/topics/spiders.html
See also the official spider tutorial for Scrapy.
Spiders are classes which inherit from Scrapy Spider classes and contains the main
logic of retrieval of content from the source and the extraction of metadata
from the source records. All spiders are located under spiders/
folder and
follows the naming standard mysource_spider.py.
Traditionally, we receive metadata in XML format so our spiders usually inherit
from a special XML parsing spider from Scrapy called XMLFeedSpider
.
from scrapy.spiders import XMLFeedSpider
class MySpider(XMLFeedSpider):
name = "myspider"
itertag = 'article' # XML tag to iterate over within each XML file
def start_requests(self):
# Retrieval of all data
def parse_node(self, response, node):
# extraction from XML per record
When you create a new spider, you need to implement at least two methods:
- start_requests: get the content from the source
- parse_node: extract the metadata from the downloaded content into a record
start_requests
handles the retrieval of data from the source,
and yield each record(s) file (in this case XML file). You can even chain
these requests to do things like following links.
For example in the World Scientific use case, we need to:
- Connect to a FTP server and get ZIP files
- For each ZIP file, extract it's contents and parse every XML
So our start_requests
function is first checking a remote FTP server for
newly added zip files and yield's the full FTP path to the zip file to Scrapy.
Scrapy knows how to download things from an FTP and gets each zip file. While
yielding we also tell Scrapy to call another function (handle_package
)
for each zip file. This is called a callback function and it is necessary to
do step 2. from above.
This function then extracts all XML files from the zip files and finally yield
each XML file (without any more callbacks) to Scrapy which now calls parse_node
and extracting metadata from a single XML file can finally begin.
parse_node
handles the extraction of records for a XMLFeedSpider
into so-called items. An item is basically a intermediate record object
where the data from the XML is put into.
The function iterates over the XML tag specified in itertag
, which means
that it supports multiple records inside a single XML file.
The goal of the parse_node
function is to generate a HEPRecord
, which in
the Scrapy world is called item. This is defined in items.py
and is an
intermediate format of the record metadata that is extracted from every
source. It tries to resemble the HEP JSON Schema as closely as
feasible, with some exceptions.
class HEPRecord(scrapy.Item):
title = scrapy.Field()
abstract = scrapy.Field()
page_nr = scrapy.Field()
journal_pages = scrapy.Field()
# etc..
To do the extraction, you are given a node
object which is a selector on
the XML record. You can now xpath (and even css) to extract content directly
into the item, via some helper functions:
def parse_node(self, response, node):
"""Parse a XML file into a HEP record."""
# for simplicity, remove all namespaces (optional)
node.remove_namespaces()
# Create a HEPRecord object with an special loader (more on this later)
record = HEPLoader(item=HEPRecord(), selector=node, response=response)
record.add_xpath('page_nr', "//counts/page-count/@count")
record.add_xpath('abstract', '//abstract[1]')
record.add_xpath('title', '//article-title/text()')
return record.load_item()
Here you see that you can directly assign a value to the HEPRecord
via
the add_xpath
function, but you are not forced to do so:
fpage = node.xpath('//fpage/text()').extract()
lpage = node.xpath('//lpage/text()').extract()
if fpage:
record.add_value('journal_pages', "-".join(fpage + lpage))
NOTE: The value added when using add_xpath
usually resolves into a Python list of values.
So remember that you need to deal with lists.
Using the add_value
you can add the value you want to a field when you need
to do some extra logic.
Since INSPIRE has multiple sources of content we will need to have multiple spiders
that retrieves and extracts data differently. However, the intermediate HEPRecord
is the common output of all sources.
This means that any additional metadata handling, such as converting journal titles
or author names to the correct format can be done in one place only. This is managed
in the HEPLoader
item loader located in loaders.py
.
The loader defines the input and output processors for the HEPRecord
item.
The input processor processes the extracted data as soon as it’s received
(through the add_xpath(), add_css() or add_value() methods). The output processor
takes the data processed by the input processors and assigns them to the field in
the item.
For example, a HEPRecord
has a field called abstract
for the abstract. We want
to take the incoming abstract string and convert some HTML tags to their LaTeX
counterparts. First we define our input processor inside inputs.py
:
def convert_html_subscripts_to_latex(text):
"""Convert some HTML tags to latex equivalents."""
text = re.sub("<sub>(.*?)</sub>", r"$_{\1}$", text)
text = re.sub("<sup>(.*?)</sup>", r"$^{\1}$", text)
return text
Then we add our input processor to the HEPLoader
:
from scrapy.loader.processors import MapCompose
from .inputs import convert_html_subscripts_to_latex
class HEPLoader(ItemLoader):
abstract_in = MapCompose(
convert_html_subscripts_to_latex,
unicode.strip,
)
To automatically link the input processors to the correct item field, we add the
suffix _in
to the field name. Then we use a special processor called
MapCompose
which takes functions as parameters and they will each be called
with each value in the field.
record.add_xpath('abstract', '//abstract[1]')
will add a list with one item:
[".. some abstract from the XML .."]
The input processors like convert_html_subscripts_to_latex is then called
with ".. some abstract from the XML .."
(per value, not the whole list).
We can also define output processors to control how the values are assigned to the fields in the items. For example, instead of a list only assign the first value in the list:
from scrapy.loader.processors import MapCompose, TakeFirst
from .inputs import convert_html_subscripts_to_latex
class HEPLoader(ItemLoader):
abstract_in = MapCompose(
convert_html_subscripts_to_latex,
unicode.strip,
)
abstract_out = TakeFirst()
Take a look here for some useful concepts when dealing with item loaders.
Finally, the data in the items are exported to INSPIRE via special item pipelines.
These classes are located under pipelines.py
and exports harvested records to
JSON files and pushes them to INSPIRE-HEP.
This documentation is still work in progress