diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..927f8ab --- /dev/null +++ b/404.html @@ -0,0 +1,821 @@ + + + +
+ + + + + + + + + + + + + + +The extractor can also be used as a library instead of on the command line.
+Typically, you would:
+WPDownloader
instance and call its download
method.WPExtractor
instance and call its extract
method. The dataframes can then be accessed as class attributes or exported with the export
method.Examples of usage are available in the CLI scripts in the extractor.cli
module.
Use the extractor.WPDownloader
class.
Possible customisations include:
+RequestSession
and passing to the session
parameter.Use the extractor.WPExtractor
class.
When using this approach, it's possible to use customised translation pickers by passing subclasses of LanguagePicker
to the
If sites publish in multiple languages and use a plugin to present a list of language versions, wpextract can parse this and add multilingual data in the output dataset.
+Extracting multilingual data is performed during the extract command. This data isn't available in the WordPress REST API response, so instead must be obtained from scraped HTML.
+Obtaining the scraped HTML is relatively straightforward, as we already have a list of all posts from the download command.
+One way this could be scraped is using jq
to parse the downloaded posts file and produce a URL list, then wget
to download each page:
$ cat posts.json | jq -r '.[] | .link' > url_list.txt
+$ touch rejected.log
+$ wget --adjust-extension --input-file=url_list.txt \
+ --wait 1 --random-wait --force-directories \
+ --rejected-log=rejected.log
+
When running the extract command, pass this directory as the --scrape-root
argument. The scrape will be crawled to match URLs to downloaded HTML files following this process.
wpextract uses an extensible system of parsers to find language picker elements and extract their data.
+Currently the following plugins are supported:
+Supports:
+Adding as a widget (e.g. to a sidebar)
+<div id="polylang-2" class="widget widget_polylang">
+ <ul>
+ <li
+ class="lang-item lang-item-18 lang-item-en current-lang lang-item-first"
+ >
+ <a
+ hreflang="en-US"
+ href="https://example.org/current-lang-page/"
+ lang="en-US"
+ >
+ <img
+ src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAALCAMAAABBPP0LAAAAmVBMVEViZsViZMJiYrf9gnL8eWrlYkjgYkjZYkj8/PujwPybvPz4+PetraBEgfo+fvo3efkydfkqcvj8Y2T8UlL8Q0P8MzP9k4Hz8/Lu7u4DdPj9/VrKysI9fPoDc/EAZ7z7IiLHYkjp6ekCcOTk5OIASbfY/v21takAJrT5Dg6sYkjc3Nn94t2RkYD+y8KeYkjs/v7l5fz0dF22YkjWvcOLAAAAgElEQVR4AR2KNULFQBgGZ5J13KGGKvc/Cw1uPe62eb9+Jr1EUBFHSgxxjP2Eca6AfUSfVlUfBvm1Ui1bqafctqMndNkXpb01h5TLx4b6TIXgwOCHfjv+/Pz+5vPRw7txGWT2h6yO0/GaYltIp5PT1dEpLNPL/SdWjYjAAZtvRPgHJX4Xio+DSrkAAAAASUVORK5CYII="
+ alt="English"
+ style="width: 16px; height: 11px"
+ width="16"
+ height="11"
+ />
+ <span style="margin-left: 0.3em">English</span>
+ </a>
+ </li>
+ <li class="lang-item lang-item-20 lang-item-fr">
+ <a
+ hreflang="fr-FR"
+ href="https://example.org/fr/translation-page/"
+ lang="fr-FR"
+ >
+ <img
+ src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAALCAMAAABBPP0LAAAAbFBMVEVzldTg4ODS0tLxDwDtAwDjAADD0uz39/fy8vL3k4nzgna4yOixwuXu7u7s6+zn5+fyd2rvcGPtZljYAABrjNCpvOHrWkxegsqfs93NAADpUUFRd8THAABBa7wnVbERRKa8vLyxsLCoqKigoKClCvcsAAAAXklEQVR4AS3JxUEAQQAEwZo13Mk/R9w5/7UERJCIGIgj5qfRJZEpPyNfCgJTjMR1eRRnJiExFJz5Mf1PokWr/UztIjRGQ3V486u0HO55m634U6dMcf0RNPfkVCTvKjO16xHA8miowAAAAABJRU5ErkJggg=="
+ alt="Français"
+ style="width: 16px; height: 11px"
+ width="16"
+ height="11"
+ />
+ <span style="margin-left: 0.3em">Français</span>
+ </a>
+ </li>
+ <li class="lang-item lang-item-22 lang-item-de no-translation">
+ <a hreflang="de-DE" href="https://example.org/de/" lang="de-DE">
+ <img
+ src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAALCAIAAAD5gJpuAAABLElEQVR4AY2QgUZEQRSGz9ydmzbYkBWABBJYABHEFhJ6m0WP0DMEQNIr9AKrN8ne2Tt3Zs7MOdOZmRBEv+v34Tvub9R6fdNlAzU+snSME/wdjbjbbJ6EiEg6BA8102QbjKNpoMzw8v6qD/sOALbbT2MC1NgaAWOKOgxf5czY+4dbAX2G/THzcozLrvPV85IQyqVz0rvg2p9Pei4HjzSsiFbV4JgyhhxCjpGdZ0RhdikLB9/b8Qig7MkpSovR7Cp59q6CazaNFiTt4J82o6uvdMVwTsztKTXZod4jgOJJuqNAjFyGrBR8gM6XwKfIC4KanBSTZ0rClKh08D9DFh3egW7ebH7NcRDQWrz9rM2Ne+mDOXB2mZJ8agL19nwxR2iZXGm1gDbQKhDjd4yHb2oW/KR8xHicAAAAAElFTkSuQmCC"
+ alt="Deutsch"
+ style="width: 16px; height: 11px"
+ width="16"
+ height="11"
+ />
+ <span style="margin-left: 0.3em">Deutsch</span>
+ </a>
+ </li>
+ <li class="lang-item lang-item-24 lang-item-es no-translation">
+ <a hreflang="es-ES" href="https://example.org/es/" lang="es-ES">
+ <img
+ src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAALCAMAAABBPP0LAAAAflBMVEX/AAD9AAD3AADxAADrAAD/eXn9bGz8YWH8WVn6UVH5SEj5Pz/3NDT0Kir9/QD+/nL+/lT18lDt4Uf6+j/39zD39yf19R3n5wDxflXsZ1Pt4Y3x8zr0wbLs1NXz8xPj4wD37t3jmkvsUU/Bz6nrykm3vJ72IiL0FBTyDAvhAABEt4UZAAAAX0lEQVR4AQXBQUrFQBBAwXqTDkYE94Jb73+qfwVRcYxVQRBRToiUfoaVpGTrtdS9SO0Z9FR9lVy/g5c99+dKl30N5uxPuviexXEc9/msC7TOkd4kHu/Dlh4itCJ8AP4B0w4Qwmm7CFQAAAAASUVORK5CYII="
+ alt="Español"
+ style="width: 16px; height: 11px"
+ width="16"
+ height="11"
+ />
+ <span style="margin-left: 0.3em">Español</span>
+ </a>
+ </li>
+ <li class="lang-item lang-item-26 lang-item-zh no-translation">
+ <a hreflang="zh-CN" href="https://example.org/zh/" lang="zh-CN">
+ <img
+ src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAALCAMAAABBPP0LAAAAXVBMVEXUAADlQgDLAADBAADtgXn63Xjypnf1wHHpcG/oZmbmXVzlU1PjS0q1AAD981775VvwnVD2zkvhPz/fNzfdMjHcKyvaJyfsi0baISHYGhqqAADWExPTDQ2jAACfAAApGpDBAAAAWklEQVR4ATXIhQHDQBTDUMll2n/RMiU5/vQsAE4EsPbaKVOU+pXNwc/WKQXeDZMKu+psCXw/Z7efarmENd6GIwGpXhUvM4spxoiEbouRNT7Fmtaq+RG4wAqZZvceD8DeIelqAAAAAElFTkSuQmCC"
+ alt="中文 (中国)"
+ style="width: 16px; height: 11px"
+ width="16"
+ height="11"
+ />
+ <span style="margin-left: 0.3em">中文 (中国)</span>
+ </a>
+ </li>
+ <li class="lang-item lang-item-41 lang-item-ar no-translation">
+ <a hreflang="ar" href="https://example.org/ar/" lang="ar">
+ <img
+ src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAALCAMAAABBPP0LAAAANlBMVEUAYjMTYDs3R0AvV0NObzE3dSoTWzhAZjgyfEY0gl1EcDFqpIhKj28TVzaLs41ol1JSaF1JW1NzUHm9AAAAPUlEQVR4AY2MtQEAMAgE447tv2xKvuQqeEtRcikZ/9p6b9X/Mdfeaw4PnPvehQhNvpcnJYiInIqraqYpyAd1AAFxIEreLQAAAABJRU5ErkJggg=="
+ alt="العربية"
+ style="width: 16px; height: 11px"
+ width="16"
+ height="11"
+ />
+ <span style="margin-left: 0.3em">العربية</span>
+ </a>
+ </li>
+ </ul>
+</div>
+
Adding to the navbar as a custom dropdown1
+<div class="header-lang_switcher switcher-ltr">
+ <div class="current-lang-switcher">
+ <img src="https://example.org/flag_en.svg" alt="flag-en" />
+ <span>en</span>
+ </div>
+ <ul>
+ <li class="lang-item lang-item-5 lang-item-fr lang-item-first">
+ <a
+ hreflang="fr-FR"
+ href="https://example.org/fr/translation-page/"
+ lang="fr-FR"
+ >Français</a
+ >
+ </li>
+ <li class="lang-item lang-item-7 lang-item-de no-translation">
+ <a hreflang="de-DE" href="https://example.org/de/" lang="de-DE"
+ >Deutsch</a
+ >
+ </li>
+ <li class="lang-item lang-item-9 lang-item-es no-translation">
+ <a hreflang="es-ES" href="https://example.org/es/" lang="es-ES"
+ >Español</a
+ >
+ </li>
+ <li class="lang-item lang-item-11 lang-item-it no-translation">
+ <a hreflang="it-IT" href="https://example.org/it/" lang="it-IT"
+ >Italiano</a
+ >
+ </li>
+ <li class="lang-item lang-item-13 lang-item-zh no-translation">
+ <a hreflang="zh-CN" href="https://example.org/zh/" lang="zh-CN"
+ >中文 (中国)</a
+ >
+ </li>
+ <li class="lang-item lang-item-15 lang-item-ar no-translation">
+ <a hreflang="ar" href="https://example.org/ar/" lang="ar"
+ >العربية</a
+ >
+ </li>
+ </ul>
+</div>
+
Does not support:
+<select>
elementSee also
+Using WPextract as a library for information on how to run wpextract as a library using additional pickers.
+Support can be added by creating a new picker definition inheriting from LangPicker
.
This parent class defines two abstract methods which must be implemented:
+LangPicker.get_root
- returns the root element of the pickerLangPicker.extract
- find the languages, call LangPicker.set_current_lang
and call LangPicker.add_translation
for eachMore complicted pickers may need to override additional methods of the class, but should still ultimately populate the LangPicker.translations
and LangPicker.current_language
attributes as the parent class does.
This section will show implementing a new picker with the following simplified markup:
+<ul class="translations">
+ <li><a href="/page/" class="lang current-lang" lang="en">English</a></li>
+ <li><a href="/de/seite/" class="lang" lang="de">Deutsch</a></li>
+ <li><a href="/page/" class="lang no-translation" lang="fr">Français</a></li>
+</ul>
+
get_root()
Using the self.page_doc
attribute, a BeautifulSoup
object representing the page, the root element of the picker should be found and returned.
The select_one
method is used to find the root element, and will return None
if no element is found, which will be intepreted as the picker not being present on the page.
If a value is returned, the self.root_el
attribute will be populated with the result of this method.
get_root
implementationextract()
Using the self.root_el
attribute, the languages should be found and added to the dataset.
Be careful to avoid: +- Adding the current language +- Adding languages which are listed but don't have translations
+extract
implementationclass MyPicker(LangPicker):
+ ...
+ def extract(self):
+ for lang_el in self.root_el.select('li'):
+ lang_a = lang_el.select_one('a')
+ if 'current-lang' in lang_a.get('class'):
+ self.set_current_lang(lang)
+ elif 'no-translation' not in lang_a.get('class'):
+ self.add_translation(lang_a.get('href'), lang_a.get('lang'))
+
We welcome contributions via a GitHub PR so long as the picker is not overly specific to a single site.
+This implementation may be overly customised to the site it was added to collect. ↩
+
extractor.WPDownloader
+
+
+WPDownloader(
+ target: str,
+ out_path: Path,
+ data_types: List[str],
+ session: Optional[RequestSession] = None,
+)
+
Manages the download of data from a WordPress site.
+ + + +PARAMETER | +DESCRIPTION | +
---|---|
target |
+
+
+
+ the target WordPress site URL +
+
+ TYPE:
+ |
+
out_path |
+
+
+
+ the output path for the downloaded data +
+
+ TYPE:
+ |
+
data_types |
+
+
+
+
+ set of data types to download + |
+
session |
+
+
+
+ request session. Will be created from default constructor if not provided. +
+
+ DEFAULT:
+ |
+
download_media_files
+
+
+download_media_files(session: RequestSession, dest: str)
+
Download site media files.
+ + +PARAMETER | +DESCRIPTION | +
---|---|
session |
+
+
+
+ the request session to use +
+
+ TYPE:
+ |
+
dest |
+
+
+
+ destination directory for media +
+
+ TYPE:
+ |
+
extractor.dl.RequestSession
+
+
+RequestSession(
+ proxy: str = None,
+ cookies: str = None,
+ authorization: AuthorizationType = None,
+ timeout: float = 30,
+ wait: float = None,
+ random_wait: bool = False,
+ max_retries: int = 10,
+ backoff_factor: float = 0.1,
+ max_redirects: int = 20,
+)
+
Wrapper to handle the requests library with session support
+ + + +PARAMETER | +DESCRIPTION | +
---|---|
proxy |
+
+
+
+ a dict containing a proxy server string for HTTP and/or HTTPS connection +
+
+ TYPE:
+ |
+
cookies |
+
+
+
+ a string in the format of the Cookie header +
+
+ TYPE:
+ |
+
authorization |
+
+
+
+ a tuple containing login and password or
+
+ TYPE:
+ |
+
timeout |
+
+
+
+ maximum time in seconds to wait for a response before giving up +
+
+ TYPE:
+ |
+
wait |
+
+
+
+ wait time in seconds between requests, None to not wait +
+
+ TYPE:
+ |
+
random_wait |
+
+
+
+ If true, the wait time between requests is multiplied by a random factor between 0.5 and 1.5 +
+
+ TYPE:
+ |
+
max_retries |
+
+
+
+ the maximum number of retries before failing +
+
+ TYPE:
+ |
+
backoff_factor |
+
+
+
+ Factor to wait between successive retries +
+
+ TYPE:
+ |
+
max_redirects |
+
+
+
+ maximum number of redirects to follow +
+
+ TYPE:
+ |
+
extractor.dl.requestsession.AuthorizationType
+
+
+
+ module-attribute
+
+
+AuthorizationType = Union[
+ Tuple[str, str], HTTPBasicAuth, HTTPDigestAuth
+]
+
extractor.WPExtractor
+
+
+WPExtractor(
+ json_root: Path,
+ scrape_root: Optional[Path] = None,
+ json_prefix: Optional[str] = None,
+ translation_pickers: Optional[PickerListType] = None,
+)
+
Manages the extraction of data from a WordPress site.
+ + + +PARAMETER | +DESCRIPTION | +
---|---|
json_root |
+
+
+
+ Path to directory of JSON files +
+
+ TYPE:
+ |
+
scrape_root |
+
+
+
+
+ Path to scrape directory + |
+
json_prefix |
+
+
+
+
+ Prefix of files in |
+
translation_pickers |
+
+
+
+ Supply a custom list of translation pickers +
+
+ TYPE:
+ |
+
categories
+
+
+
+ instance-attribute
+
+
+DataFrame of extracted categories.
+
link_registry
+
+
+
+ instance-attribute
+
+
+link_registry: LinkRegistry = LinkRegistry()
+
Registry of known URLs and their corresponding data items.
+
media
+
+
+
+ instance-attribute
+
+
+DataFrame of extracted media.
+
pages
+
+
+
+ instance-attribute
+
+
+DataFrame of extracted pages.
+
posts
+
+
+
+ instance-attribute
+
+
+DataFrame of extracted posts.
+
tags
+
+
+
+ instance-attribute
+
+
+DataFrame of extracted tags.
+
users
+
+
+
+ instance-attribute
+
+
+DataFrame of extracted users.
+
Link
+
+
+
+ dataclass
+
+
+A link to a URL.
+ + + + +
LinkRegistry
+
+
+A collection of all known links on the site.
+ + + + +
add_linkable
+
+
+Add a single linkable item to the registry.
+The URL will be compared later against a list of links that +need to be resolved and the data type and IDX will be returned.
+Data types should be unique. +IDXes should be unique within one or more data types.
+ + +PARAMETER | +DESCRIPTION | +
---|---|
url |
+
+
+
+ The URL of the destination +
+
+ TYPE:
+ |
+
data_type |
+
+
+
+ A unique identifier for this type of item. +
+
+ TYPE:
+ |
+
idx |
+
+
+
+ A unique identifier within the data type. +
+
+ TYPE:
+ |
+
_refresh_cache |
+
+
+
+ Whether the link cache should be updated. Should be left as +True unless multiple links are being added together. +
+
+ TYPE:
+ |
+
add_linkables
+
+
+Add multiple linkable items at once.
+ + +PARAMETER | +DESCRIPTION | +
---|---|
data_type |
+
+
+
+ The data type for all items. +
+
+ TYPE:
+ |
+
links |
+
+
+
+
+ A list of links. Must be the same length as idxes. + |
+
idxes |
+
+
+
+
+ A list of IDs. Must be the same length as links. + |
+
RAISES | +DESCRIPTION | +
---|---|
+
+ ValueError
+
+ |
+
+
+
+ if the links and idxes lists are not the same length. + |
+
query_link
+
+
+Find a linkable item by the URL in the registry.
+Returns None if no URL matches.
+ + +PARAMETER | +DESCRIPTION | +
---|---|
href |
+
+
+
+ A URL to search +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Optional[Linkable]
+
+ |
+
+
+
+ A matching linkable + |
+
Linkable
+
+
+
+ dataclass
+
+
+An item which can be linked to.
+ + + + +
extractor.parse.translations.LangPicker
+
+
+LangPicker(page_doc: BeautifulSoup)
+
+ Bases: ABC
Abstract class of a language picker style.
+Support for a new language picker can be added by creating a new class inheriting from this one.
+ + + +PARAMETER | +DESCRIPTION | +
---|---|
page_doc |
+
+
+
+ The document to extract a language picker from. +
+
+ TYPE:
+ |
+
current_language
+
+
+
+ instance-attribute
+
+
+The current language of the page, populated by calling LangPicker.set_current_lang
within LangPicker.extract
.
page_doc
+
+
+
+ instance-attribute
+
+
+page_doc: BeautifulSoup = page_doc
+
The document to extract the language picker from.
+
root_el
+
+
+
+ instance-attribute
+
+
+root_el: Tag
+
The root element of the language picker, populated if LangPicker.matches
is succesful.
translations
+
+
+
+ instance-attribute
+
+
+translations: List[TranslationLink] = []
+
A list of translation links, populated by calling LangPicker.add_translation
within LangPicker.extract
.
add_translation
+
+
+
extract
+
+
+
+ abstractmethod
+
+
+Extract the current language and translations from the doc.
+ +
get_root
+
+
+
+ abstractmethod
+
+
+Retrieve the root element of the translation picker.
+Using the LangPicker.page_doc
attribute (a bs4.BeautifulSoup
object representing the whole page), the root element of the picker shoudl be found and returned.
RETURNS | +DESCRIPTION | +
---|---|
+
+ PageElement
+
+ |
+
+
+
+ The root element, or None if this picker is not found on the page. + |
+
matches
+
+
+matches() -> bool
+
Checks if this picker can extract from the document.
+ + +RETURNS | +DESCRIPTION | +
---|---|
+
+ bool
+
+ |
+
+
+
+ If the page uses this type of matcher. + |
+
RAISES | +DESCRIPTION | +
---|---|
+
+ TypeError
+
+ |
+
+
+
+ If the root element that has been retrieved is not a tag, +or has 0 children. +This may happen if it accidentally retrieves a text node. + |
+
extractor.parse.translations.PickerListType
+
+
+
+ module-attribute
+
+
+PickerListType = List[Type[LangPicker]]
+
extractor.parse.translations.TranslationLink
+
+
+
+ dataclass
+
+
+TranslationLink(
+ text: Optional[str],
+ href: Optional[str],
+ destination: Optional[Linkable],
+ lang: str,
+)
+
+ Bases: ResolvableLink
A link to an alternative version of this article in a different language.
+ + + + +