- 1. Creating a Parser Stub
- 2. Creating a Publisher Specification
- 4. Validating the Current Implementation Progress
- 5. Implementing the Parser
- 6. Generate unit tests
- 7. Opening a Pull Request
Before contributing a publisher make sure you setup Fundus correctly alongside this steps. Then check the supported publishers table if there is already support for your desired publisher. In the following, we will walk you through an example implementation of the Los Angeles Times covering the best practices for adding a new publisher.
Take a look at the file structure in fundus/publishers
.
Fundus uses the ALPHA-2 codes specified in ISO3166 to sort publishers into directories by country of origin.
For example:
fundus/publishers/de/
for German publishersfundus/publishers/us/
for US publishers- ...
In case you don't see a directory labelled with the corresponding country code, feel free to create one.
Within this directory add a file called __init__.py
and create a class inheriting the PublisherEnum behaviour.
As an example, if you were to add the US, it should look something like this:
from fundus.publishers.base_objects import PublisherEnum
class US(PublisherEnum):
pass
Next, you should open the file fundus/publishers/__init__.py
and make sure that the class PublisherCollection has an attribute corresponding to your newly added country:
from fundus.publishers.base_objects import PublisherCollectionMeta
from fundus.publishers.us import US
class PublisherCollection(metaclass=PublisherCollectionMeta):
us = US
Now create an empty file in the corresponding country section using the publishers' name or some abbreviation as the file name.
For the Los Angeles Times, the correct country section is fundus/publishers/us/
, since they are a newspaper based in the United States, with a filename like la_times.py
or los_angeles_times.py
.
We will continue here with la_times.py
.
In the newly created file, add an empty parser class inheriting from ParserProxy
and a parser version V1
subclassing BaseParser
.
from fundus.parser import ParserProxy, BaseParser
class LATimesParser(ParserProxy):
class V1(BaseParser):
pass
Internally, the ParserProxy
maps crawl dates to specific versions (V1
, V2
, etc.) subclassing BaseParser
.
Since Fundus' parsers are handcrafted and usually tied to specific layouts, this proxying step helps address changes to the layout.
Next, add a new publisher specification for the publisher you want to cover.
The publisher specification links information about the publisher, sources from where to get the HTML to parse, and the corresponding parser used by Fundus' Crawler
.
You can add a new entry to the country-specific PublisherEnum
in the __init__.py
of the country section you want to contribute to, i.e. fundus/publishers/<country_code>/__init__.py
.
For now, we only specify the publisher's name, domain, and parser.
We will cover sources in the next step.
For the Los Angeles Times (LA Times), we add the following entry to fundus/publishers/us/__init__.py
.
class US(PublisherEnum):
LATimes = PublisherSpec(
name="Los Angeles Times",
domain="https://www.latimes.com/",
parser=LATimesParser,
)
If the country section for your publisher did not exist before step 1, please add the PublisherEnum
to the PublisherCollection
in fundus/publishers/__init__.py'
.
For your newly added publisher to work you first need to specify where to find articles - in the form of HTML - to parse.
Fundus adopts a unique approach by utilizing access points provided by the publishers, rather than resorting to generic web spiders.
Publishers offer various methods to access their articles, with the most common being RSS feeds, APIs, or sitemaps.
Presently, Fundus supports RSS feeds and sitemaps by adding them as corresponding URLSource
using the source
parameter of PublisherSpec
.
Fundus provides the following types of URLSource
, which you can import from fundus.scraping.html
.
RSSFeed
- specifying RSS feedsSitemap
- specifying sitemapsNewsMap
- specifying a special kind of sitemap displaying only recent articles
Fundus distinguishes between these source types to facilitate crawling only recent articles (RSSFeed
, NewsMap
) or an entire website (Sitemap
).
This differentiation is mainly for efficiency reasons.
Refer to this documentation on how to filter for different source types.
NOTE: When adding a new publisher, it is recommended to specify at least one Sitemap
and one RSSFeed
or NewsMap
(preferred).
If your publisher provides a NewsFeed
, there is no need to specify an RSSFeed
.
To instantiate an object inheriting from URLSource like RSSFeed
or Sitemap
, you first need to find a link to the corresponding feed or sitemap and then set it as the entry point using the url
parameter of URLSource
.
Getting links for RSS feeds can vary from publisher to publisher.
Most of the time, you can find them through a quick browser search.
Building an RSSFeed
looks like this:
from fundus import RSSFeed
RSSFeed("https://theintercept.com/feed/?rss")
Sitemaps consist of a collection of <url>
tags, indicating links to articles with properties attached, following a standardized schema.
A typical sitemap looks like this:
<urlset ... >
<url>
<loc>https://www.latimes.com/recipe/peach-frozen-yogurt</loc>
<lastmod>2020-01-29</lastmod>
</url>
...
NOTE: There is a known issue with Firefox not displaying XML properly. You can find a plugin to resolve this issue here
Links to sitemaps are typically found within the robots.txt
file provided by the publisher, often located at the end of it.
To access this file, append robots.txt
at the end of the publisher's domain.
For example, to access the LA Times' robots.txt
, use https://www.latimes.com/robots.txt in your preferred browser.
This will give you the following two sitemap links:
Sitemap: https://www.latimes.com/sitemap.xml
Sitemap: https://www.latimes.com/news-sitemap.xml
The former refers to a regular sitemap, and the latter points to a NewsMap, which is a special kind of sitemap. To have a look at how to differentiate between those two, refer to this section.
Most Sitemaps
, and sometimes NewsMaps
as well, will be index maps.
E.g. accessing https://www.latimes.com/news-sitemap.xml
will give you something like this:
<sitemapindex ... >
<sitemap>
<loc>https://www.latimes.com/news-sitemap-content.xml</loc>
<lastmod>2023-08-02T07:10-04:00</lastmod>
</sitemap>
...
</sitemapindex>
The <sitemap>
, and especially the <sitemapindex>
tag, indicates that this is, in fact, an index map pointing to other sitemaps rather than articles.
To address this, Sitemap
and NewsMap
will step through the given sitemap recursively by default.
You can alter this behavior or reverse the order in which sitemaps are processed with the recursive
respectively reverse
parameters.
NOTE: If you wonder why you should reverse your sources from time to time, URLSource
's should, if possible, yield URLs in descending order by publishing date.
Now building a new URLSource
for a NewsMap
covering the LA Times looks like this:
from fundus import NewsMap
NewsMap("https://www.latimes.com/news-sitemap.xml", reverse=True)
Fundus differentiates between two types of sitemaps:
Those that almost or actually span the entire site (Sitemap
) and those that only reference recent articles (NewsMap
), often called Google News Maps.
You can check if a sitemap is a news map by:
- Checking the file name:
Often there is a string like
news
included. While this is a very simple method this can be unreliable. - Checking the namespace:
Typically there is a namespace
news
defined within a news map using thexmlns:news
attribute of the<urlset>
tag. E.g.<urlset ... xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" ... >
NOTE: This can only be found within the actual sitemap and not the index map.
- Sometimes sitemaps can include a lot of noise like maps pointing to a collection of tags or authors, etc.
You can use the
sitemap_filter
parameter ofSitemap
orNewsMap
to prefilter these based on a regular expression. E.g.Will filter out all URLs encountered within the processing of theSitemap("https://apnews.com/sitemap.xml", sitemap_filter=regex_filter("apnews.com/hub/|apnews.com/video/"))
Sitemap
object including either the stringapnews.com/hub/
orapnews.com/video/
. - If your publisher requires to use custom request headers to work properly you can alter it by using the
request_header
parameter ofPublisherSpec
. The default is:{"user_agent": "Fundus"}
. - If you want to block URLs for the entire publisher use the
url_filter
parameter ofPublisherSpec
. - In some cases it can be necessary to append query parameters to the end of the URL, e.g. to load the article as one page. This can be achieved by adding the
query_parameter
attribute ofPublisherSpec
and assigning it a dictionary object containing the key - value pairs: e.g.{"page": "all"}
. These key - value pairs will be appended to all crawled URLs.
Now, let's put it all together to specify the LA Times as a new publisher in Fundus:
class US(PublisherEnum):
LATimes = PublisherSpec(
name="Los Angeles Times",
domain="https://www.latimes.com/",
sources=[Sitemap("https://www.latimes.com/sitemap.xml"),
NewsMap("https://www.latimes.com/news-sitemap.xml")],
parser=LATimesParser,
)
Now validate your implementation progress by crawling some example articles from your publisher. The following script fits the Los Angeles Times and is adaptable by changing the publisher variable accordingly.
from fundus import PublisherCollection, Crawler
# Change to:
# PublisherCollection.<country_section>.<publisher_specification>
publisher = PublisherCollection.us.LosAngelesTimes
crawler = Crawler(publisher)
for article in crawler.crawl(max_articles=2, only_complete=False):
print(article)
If everything has been implemented correctly, the script should output text articles like the following.
Fundus-Article:
- Title: "--missing title--"
- Text: "--missing plaintext--"
- URL: https://www.latimes.com/sports/story/2023-06-26/100-years-los-angeles-coliseum-historical-events
- From: Los Angeles Times
Fundus-Article:
- Title: "--missing title--"
- Text: "--missing plaintext--"
- URL: https://www.latimes.com/sports/sparks/story/2023-06-25/los-angeles-sparks-dallas-wings-wnba-game-analysis
- From: Los Angeles Times
Since we didn't add any specific implementation to the parser yet, most entries are empty.
Now bring your parser to life and define the attributes you want to extract.
One important caveat to consider is the type of content on a particular page. Some news outlets feature live tickers, displaying podcasts, or hub sites that link to other pages but are not articles themselves. At this stage, there's no need to concern yourself with handling non-article pages. our parser should concentrate on extracting desired attributes from most pages that can be classified as articles. Pages lacking the desired attributes will be filtered out by the library during a later phase of the processing pipeline.
You can add attributes by decorating the methods of your parser with the @attribute
decorator.
The expected return value for each attribute must precisely match the specifications outlined in the attribute guidelines.
For instance, if you want to extract article titles, first refer to the attribute guidelines and identify an attribute that aligns with your objective.
There you can locate an attribute named title
, which precisely corresponds to what you aim to extract, along with its expected return type.
It is essential to adhere to the specified return types, as they are enforced through our unit tests.
While you're welcome to experiment locally, contributions to the repository won't be accepted if your pull request deviates from the guidelines.
NOTE:
Should you wish to add an attribute not covered in the guidelines, set the validate
parameter of the attribute decorator to False
, like this:
@attribute(validate=False)
def unsupported_attribute(self):
...
Attributes marked with validate=False
will not be validated through unit tests.
Now, once we have identified the attribute we want to extract, we add it to the parser by defining a method using the associated name, in our case title
, and marking it as an attribute using the @attribute
decorator.
class LATimesParser(ParserProxy):
class V1(BaseParser):
@attribute
def title(self) -> Optional[str]:
return "This is a title"
To see the results of our newly added titles, we can use the following code:
for article in crawler.crawl(max_articles=2):
print(article.title)
This should print the following output:
This is a title
This is a title
Fundus will automatically add your decorated attributes as instance attributes to the article
object during parsing.
Additionally, attributes defined in the attribute guidelines are explicitly defined as dataclasses.fields
.
One way to extract useful information from articles rather than placeholders is to utilize the ld
and meta
attributes of the Article
.
These attributes are automatically extracted when they are present in the currently parsed HTML.
Often, valuable information about an article, such as the title
, author
, or topics
, can be found in these two objects.
To access them during parsing, you can use the precomputed
attribute of BaseParser
, which references a dataclass
of type Precomputed
.
This object contains meta-information about the article you're currently parsing.
@dataclass
class Precomputed:
html: str
doc: lxml.html.HtmlElement
meta: Dict[str, str]
ld: LinkedData
cache: Dict[str, Any]
Here is a brief description of the fields of Precomputed
.
Precomputed Attribute | Description |
---|---|
html | The original fetched HTML. |
doc | The root node of an lxml.html.Etree spanning the fetched html. |
meta | The article's meta-information extracted from <meta> tags. |
ld | The linked data extracted from the HTML's ld+json |
cache | A cache specific to the currently parsed site, which can be used to share objects between attributes. Share objects with the BaseParser.share method. |
For instance, to extract the title for an article in the Los Angeles Times, we can access the og:title
through the attribute meta
of Precomputed
.
@attribute
def title(self) -> Optional[str]:
# Use the `get` function to retrieve data from the `meta` precomputed attribute
return self.precomputed.meta.get("og:title")
When parsing the ArticleBody
, or the desired information cannot be extracted from the ld
or meta
attributes, you need to directly obtain information from the Document Object Model (DOM) of the HTML/XML.
The DOM serves as an interface representing the underlying HTML or XML file as a tree structure, where each element (tag) of the file functions as a node in the tree.
To select or search respectively for the information you need you can access these nodes using selectors like CSS-Select or XPath.
Fundus relies on the Python package lxml
and its selector implementation.
Consider the following HTML example.
<html lang="de">
<head>
<meta charset="utf-8">
<title>...</title>
</head>
<body>
<h2>This is a heading.</h2>
<p>This is a paragraph inside the body.</p>
<p class="A">This is a paragraph with a class.</p>
<div>
<p>This is a paragraph within a div</p>
</div>
<div class="B">
<p>This is a paragraph within a div of class B</p>
</div>
<p additional-attribute="not allowed">This is a paragraph with a weird attribute</p>
</body>
</html>
To work with lxml
selectors, the initial step involves constructing an Etree
, which represents the DOM of the HTML.
This is achieved as follows:
root = lxml.html.document_fromstring(html)
This will return an object of type lxml.html.HtmlElement
representing the root node of the DOM tree.
Within the Fundus parser, the DOM tree is already generated for each article, and the root node can be accessed using the doc
parameter of Precomputed
.
Next we will show you how to specify search conditions in the form of selectors and use them on the tree.
CSS-Select is generally a simpler, but less comprehensive, selector compared to XPath. In most instances, it's advisable to use CSS-Select and resort to XPath only when necessary. To define your selector we recommend using this reference.
Here's an example of creating a selector to target all <p>
tags within the tree and extracting their text content using text_content()
:
from lxml.cssselect import CSSSelector
selector = CSSSelector("p")
nodes = selector(root)
for node in nodes:
print(node.text_content())
This should print the following lines:
This is a paragraph inside the body.
This is a paragraph with a class.
This is a paragraph within a div
This is a paragraph within a div of class B
This is a paragraph with a weird attribute
NOTE: The nodes are returned in depth-first pre-order.
Similarly, you can select based on the class
attribute of a tag.
For instance, selecting all <p>
tags with class A
looks like this.
selector = CSSSelector("p.A")
Which will print:
This is a paragraph with a class.
Often you need to select tags depending on their parents.
To illustrate, let's select all <p>
tags that have a <div>
tag as their parent.
selector = CSSSelector("div > p")
Output:
This is a paragraph within a div
This is a paragraph within a div of class B
Combining these techniques, you can select all <p>
tags that have a parent <div>
with class B
.
selector = CSSSelector("div.B > p")
Output:
This is a paragraph within a div of class B
Selectors can also target nodes with specific attribute values, even if those attributes are not standard in the HTML specification:
selector = CSSSelector("p[additional-attribute='not allowed']")
Output:
This is a paragraph with a weird attribute
NOTE: It's also possible to select solely by the existence of an attribute by omitting the equality.
Sticking to the above example you can simply use CSSSelector("p[additional-attribute]")
instead.
Given the complexity of XPath compared to CSS-Select, we refrain from providing an extensive tutorial here. Instead, we recommend referring to this documentation for a translation table and a concise overview of XPath functionalities beyond CSS-Select.
NOTE: Although it's possible to select nodes using the built-in methods of lxml.html.HtmlElement
, it's recommended to use the dedicated selectors CSSSelect
and XPath
, as demonstrated in the above examples.
NOTE: The fundus/parser/utility.py
module includes several utility functions that can assist you in implementing parser attributes.
Make sure to examine other parsers and consult the attribute guidelines for specifics on attribute implementation.
We strongly encourage utilizing these utility functions, especially when parsing the ArticleBody
.
In case your new publisher does not have a subscription model, you can go ahead and skip this step.
If it does, please verify that there is a tag isAccessibleForFree
within the HTMLs ld+json
elements (refer to the section Extracting attributes from Precomputed for details) in the source code of premium articles that is set to either false
or False
, true
/True
respectively.
It doesn't matter if the tag is missing in the freely accessible articles.
If this is the case, you can continue with the next step. If not, please overwrite the existing function by adding the following snippet to your parser:
@attribute
def free_access(self) -> bool:
# Your personalized logic goes here
...
Usually you can identify a premium article by an indicator within the URL or by using XPath or CSSSelector and selecting the element asking to purchase a subscription to view the article.
Bringing all the above together, the Los Angeles Times now looks like this.
import datetime
from typing import List, Optional
from lxml.cssselect import CSSSelector
from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute
from fundus.parser.utility import (
extract_article_body_with_selector,
generic_author_parsing,
generic_date_parsing,
)
class LATimesParser(ParserProxy):
class V1(BaseParser):
_paragraph_selector = CSSSelector("div[data-element*=story-body] > p")
@attribute
def body(self) -> ArticleBody:
return extract_article_body_with_selector(
self.precomputed.doc,
paragraph_selector=self._paragraph_selector,
)
@attribute
def publishing_date(self) -> Optional[datetime.datetime]:
return generic_date_parsing(self.precomputed.ld.bf_search("datePublished"))
@attribute
def authors(self) -> List[str]:
return generic_author_parsing(self.precomputed.ld.bf_search("author"))
@attribute
def title(self) -> Optional[str]:
return self.precomputed.meta.get("og:title")
Now, execute the example script from step 4 to validate your implementation. If the attributes are implemented correctly, they appear in the printout accordingly.
Fundus-Article:
- Title: "One hundred years at the Coliseum: Much more than a sports venue"
- Text: "Construction for the Los Angeles Coliseum was completed on May 1, 1923. Capacity
at the time: 75,000. The stadium was designed by architects John [...]"
- URL: https://www.latimes.com/sports/story/2023-06-26/100-years-los-angeles-coliseum-historical-events
- From: Los Angeles Times (2023-06-26 12:00)
Fundus-Article:
- Title: "Sparks back at .500: Five things to know about the team after win Sunday"
- Text: "Finally, the home crowd at Crypto.com Arena had something to cheer about. After
dropping the first three games of their longest homestand of the [...]"
- URL: https://www.latimes.com/sports/sparks/story/2023-06-25/los-angeles-sparks-dallas-wings-wnba-game-analysis
- From: Los Angeles Times (2023-06-25 21:30)
To finish your newly added publisher you should add unit tests for the parser. We recommend you do this with the provided script.
To get started with this script, you may read the provided manual:
python -m scripts.generate_parser_test_files -h
Then in most cases it should be enough to simply run
python -m scripts.generate_parser_test_files -p <your_publisher_class>
with <your_publisher_class> being the class name of the PublisherEnum
your working on.
In our case, we would run:
python -m scripts.generate_parser_test_files -p LATimes
to generate a unit test for our parser.
To fully integrate your new publisher you have to add it to the supported publishers table. You do so by simply running
python -m scripts.generate_tables
Now to test your newly added publisher you should run pytest with the following command:
pytest
- Make sure you tested your parser using
pytest
. - Run
black src
,isort src
, andmypy src
with no errors. - Push and open a new PR
- Congratulation and thank you very much.