Skip to content

Commit

Permalink
rename main package
Browse files Browse the repository at this point in the history
  • Loading branch information
freddyheppell committed Jul 9, 2024
1 parent 209cf84 commit e79e980
Show file tree
Hide file tree
Showing 67 changed files with 124 additions and 124 deletions.
14 changes: 7 additions & 7 deletions docs/advanced/library.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,24 @@ The extractor can also be used as a library instead of on the command line.

Typically, you would:

- instantiate a [`WPDownloader`][extractor.WPDownloader] instance and call its [`download`][extractor.WPDownloader.download] method.
- instantiate a [`WPExtractor`][extractor.WPExtractor] instance and call its `extract` method. The dataframes can then be accessed as class attributes or exported with the `export` method.
- instantiate a [`WPDownloader`][wpextract.WPDownloader] instance and call its [`download`][wpextract.WPDownloader.download] method.
- instantiate a [`WPExtractor`][wpextract.WPExtractor] instance and call its `extract` method. The dataframes can then be accessed as class attributes or exported with the `export` method.

Examples of usage are available in the CLI scripts in the `extractor.cli` module.
Examples of usage are available in the CLI scripts in the `wpextract.cli` module.



## Downloader

Use the [`extractor.WPDownloader`][extractor.WPDownloader] class.
Use the [`wpextract.WPDownloader`][wpextract.WPDownloader] class.

Possible customisations include:

- Implement highly custom request behaviour by subclassing [`RequestSession`][extractor.dl.RequestSession] and passing to the `session` parameter.
- Implement highly custom request behaviour by subclassing [`RequestSession`][wpextract.dl.RequestSession] and passing to the `session` parameter.


## Extractor

Use the [`extractor.WPExtractor`][extractor.WPExtractor] class.
Use the [`wpextract.WPExtractor`][wpextract.WPExtractor] class.

When using this approach, it's possible to use [customised translation pickers](../advanced/multilingual.md#adding-support) by passing subclasses of [`LanguagePicker`][extractor.parse.translations.LangPicker] to the
When using this approach, it's possible to use [customised translation pickers](../advanced/multilingual.md#adding-support) by passing subclasses of [`LanguagePicker`][wpextract.parse.translations.LangPicker] to the
8 changes: 4 additions & 4 deletions docs/advanced/multilingual.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,14 +58,14 @@ Currently the following plugins are supported:
!!! info "See also"
[Using WPextract as a library](library.md) for information on how to run wpextract as a library using additional pickers.

Support can be added by creating a new picker definition inheriting from [`LangPicker`][extractor.parse.translations.LangPicker].
Support can be added by creating a new picker definition inheriting from [`LangPicker`][wpextract.parse.translations.LangPicker].

This parent class defines two abstract methods which must be implemented:

- [`LangPicker.get_root`][extractor.parse.translations.LangPicker.get_root] - returns the root element of the picker
- [`LangPicker.extract`][extractor.parse.translations.LangPicker.extract] - find the languages, call [`LangPicker.set_current_lang`][extractor.parse.translations.LangPicker.set_current_lang] and call [`LangPicker.add_translation`][extractor.parse.translations.LangPicker.add_translation] for each
- [`LangPicker.get_root`][wpextract.parse.translations.LangPicker.get_root] - returns the root element of the picker
- [`LangPicker.extract`][wpextract.parse.translations.LangPicker.extract] - find the languages, call [`LangPicker.set_current_lang`][wpextract.parse.translations.LangPicker.set_current_lang] and call [`LangPicker.add_translation`][wpextract.parse.translations.LangPicker.add_translation] for each

More complicted pickers may need to override additional methods of the class, but should still ultimately populate the [`LangPicker.translations`][extractor.parse.translations.LangPicker.translations] and [`LangPicker.current_language`][extractor.parse.translations.LangPicker.current_language] attributes as the parent class does.
More complicted pickers may need to override additional methods of the class, but should still ultimately populate the [`LangPicker.translations`][wpextract.parse.translations.LangPicker.translations] and [`LangPicker.current_language`][wpextract.parse.translations.LangPicker.current_language] attributes as the parent class does.

This section will show implementing a new picker with the following simplified markup:

Expand Down
6 changes: 3 additions & 3 deletions docs/api/downloader.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

## Downloading

::: extractor.WPDownloader
::: wpextract.WPDownloader

## Configuring Request Behaviour

::: extractor.dl.RequestSession
::: wpextract.dl.RequestSession
options:
members: false

::: extractor.dl.requestsession.AuthorizationType
::: wpextract.dl.requestsession.AuthorizationType
10 changes: 5 additions & 5 deletions docs/api/extractor.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
# Extractor API

## Extraction
::: extractor.WPExtractor
::: wpextract.WPExtractor

## Extraction Data


::: extractor.extractors.data.links
::: wpextract.extractors.data.links
options:
show_root_heading: false
show_root_toc_entry: false

## Multilingual Extraction

::: extractor.parse.translations.LangPicker
::: wpextract.parse.translations.LangPicker

::: extractor.parse.translations.PickerListType
::: wpextract.parse.translations.PickerListType

::: extractor.parse.translations.TranslationLink
::: wpextract.parse.translations.TranslationLink
options:
inherited_members:
- destination
Expand Down
2 changes: 1 addition & 1 deletion docs/usage/extract.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ The extraction process is applied to all posts simultaneously in the following o
2. Parse the HTML content from the API response input.
3. Parse the HTML content from the scrape file, if it was found for the link during the crawl
4. Extract the post's language and translations from the scrape file
* Translations are detected using the translation pickers (implementing [`LangPicker`][extractor.parse.translations.LangPicker])
* Translations are detected using the translation pickers (implementing [`LangPicker`][wpextract.parse.translations.LangPicker])
* Custom pickers can be added if using this tool as a library
* Any extracted translations are stored as unresolved links
5. Add the post's link to the link registry
Expand Down
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ plugins:
ignore_init_summary: true
docstring_section_style: spacy
# filters: ["!^_"]
preload_modules: ["extractor"]
preload_modules: ["wpextract"]
heading_level: 3
inherited_members: true
merge_init_into_class: true
Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@ version="1.0.0a0"
description="Create a dataset from the WordPress API"
authors=["Freddy Heppell <[email protected]>"]
packages=[
{ include = "extractor", from = "src"}
{ include = "wpextract", from = "src"}
]
homepage="https://gatenlp.github.io/wordpress-site-extractor/"
repository="https://github.com/GateNLP/wordpress-site-extractor"
license="Apache-2.0"
readme = "README.md"

[tool.poetry.scripts]
wpextract = "extractor.cli.cli:main"
wpextract = "wpextract.cli.cli:main"

# Workaround for https://github.com/python-poetry/poetry/issues/9293
[[tool.poetry.source]]
Expand Down
File renamed without changes.
File renamed without changes.
6 changes: 3 additions & 3 deletions src/extractor/cli/_dl.py → src/wpextract/cli/_dl.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
from argparse import Namespace

from extractor.cli._shared import _register_shared
from extractor.dl.downloader import WPDownloader
from extractor.dl.requestsession import RequestSession
from wpextract.cli._shared import _register_shared
from wpextract.dl.downloader import WPDownloader
from wpextract.dl.requestsession import RequestSession

dl_types = ["categories", "media", "pages", "posts", "tags", "users"]

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from extractor.cli._shared import _register_shared
from extractor.extract import WPExtractor
from extractor.util.args import directory, empty_directory
from wpextract.cli._shared import _register_shared
from wpextract.extract import WPExtractor
from wpextract.util.args import directory, empty_directory


def register_extract_parser(subparsers):
Expand Down
File renamed without changes.
4 changes: 2 additions & 2 deletions src/extractor/cli/cli.py → src/wpextract/cli/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
from tqdm.auto import tqdm
from tqdm.contrib.logging import logging_redirect_tqdm

from extractor.cli._dl import do_dl, register_dl_parser
from extractor.cli._extract import do_extract, register_extract_parser
from wpextract.cli._dl import do_dl, register_dl_parser
from wpextract.cli._extract import do_extract, register_extract_parser


def _exec_command(parser, args):
Expand Down
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
from pathlib import Path
from typing import List, Optional

from extractor.dl.exceptions import WordPressApiNotV2
from extractor.dl.exporter import Exporter
from extractor.dl.requestsession import RequestSession
from extractor.dl.wpapi import WPApi
from wpextract.dl.exceptions import WordPressApiNotV2
from wpextract.dl.exporter import Exporter
from wpextract.dl.requestsession import RequestSession
from wpextract.dl.wpapi import WPApi


class WPDownloader:
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@

from tqdm.auto import tqdm

from extractor.dl.requestsession import RequestSession
from wpextract.dl.requestsession import RequestSession


class Exporter:
Expand Down
File renamed without changes.
File renamed without changes.
6 changes: 3 additions & 3 deletions src/extractor/dl/wpapi.py → src/wpextract/dl/wpapi.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,17 +26,17 @@

from tqdm.auto import tqdm

from extractor.dl.exceptions import (
from wpextract.dl.exceptions import (
NoWordpressApi,
NSNotFoundException,
WordPressApiNotV2,
)
from extractor.dl.requestsession import (
from wpextract.dl.requestsession import (
HTTPError404,
HTTPErrorInvalidPage,
RequestSession,
)
from extractor.dl.utils import (
from wpextract.dl.utils import (
get_by_id,
get_content_as_json,
url_path_join,
Expand Down
22 changes: 11 additions & 11 deletions src/extractor/extract.py → src/wpextract/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,22 @@

from pandas import DataFrame

from extractor.extractors.categories import load_categories
from extractor.extractors.data.links import LinkRegistry
from extractor.extractors.io import export_df
from extractor.extractors.media import load_media
from extractor.extractors.pages import load_pages
from extractor.extractors.posts import (
from wpextract.extractors.categories import load_categories
from wpextract.extractors.data.links import LinkRegistry
from wpextract.extractors.io import export_df
from wpextract.extractors.media import load_media
from wpextract.extractors.pages import load_pages
from wpextract.extractors.posts import (
ensure_translations_undirected,
load_posts,
resolve_post_links,
resolve_post_translations,
)
from extractor.extractors.tags import load_tags
from extractor.extractors.users import load_users
from extractor.parse.translations import PickerListType
from extractor.scrape.crawler import ScrapeCrawl
from extractor.util.file import prefix_filename
from wpextract.extractors.tags import load_tags
from wpextract.extractors.users import load_users
from wpextract.parse.translations import PickerListType
from wpextract.scrape.crawler import ScrapeCrawl
from wpextract.util.file import prefix_filename


class WPExtractor:
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
import numpy as np
import pandas as pd

from extractor.extractors.data.links import LinkRegistry
from extractor.extractors.io import load_df
from extractor.util.locale import extract_locale
from wpextract.extractors.data.links import LinkRegistry
from wpextract.extractors.io import load_df
from wpextract.util.locale import extract_locale

EXPORT_COLUMNS = [
"name",
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from dataclasses import dataclass
from typing import List, Optional

from extractor.extractors.data.links import Linkable, LinkRegistry
from wpextract.extractors.data.links import Linkable, LinkRegistry


@dataclass
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
from typing import List, Optional
from urllib.parse import urlparse, urlunparse

from extractor.extractors.data.links import LinkRegistry, ResolvableLink
from extractor.util.str import remove_ends
from wpextract.extractors.data.links import LinkRegistry, ResolvableLink
from wpextract.util.str import remove_ends


def resolve_link(
Expand Down
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
import pandas as pd
from bs4 import Tag

from extractor.extractors.data.links import LinkRegistry
from extractor.extractors.io import load_df
from extractor.parse.html import extract_html_text
from wpextract.extractors.data.links import LinkRegistry
from wpextract.extractors.io import load_df
from wpextract.parse.html import extract_html_text

EXPORT_COLUMNS = [
"alt_text",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
import pandas as pd
from tqdm.auto import tqdm

from extractor.extractors.data.links import LinkRegistry
from extractor.extractors.io import load_df
from extractor.parse.content import extract_content_data
from extractor.parse.html import extract_html_text, parse_html
from extractor.util.locale import extract_locale
from wpextract.extractors.data.links import LinkRegistry
from wpextract.extractors.io import load_df
from wpextract.parse.content import extract_content_data
from wpextract.parse.html import extract_html_text, parse_html
from wpextract.util.locale import extract_locale

EXPORT_COLUMNS = [
"author",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,16 @@
from pandas import DataFrame
from tqdm.auto import tqdm

from extractor.extractors.data.images import resolve_images
from extractor.extractors.data.link_resolver import resolve_links
from extractor.extractors.data.links import LinkRegistry
from extractor.extractors.io import load_df
from extractor.parse.content import extract_content_data
from extractor.parse.html import extract_html_text, parse_html
from extractor.parse.translations import PickerListType, extract_translations
from extractor.parse.translations._resolver import TranslationLink
from extractor.scrape.scrape import load_scrape
from extractor.util.locale import extract_locale
from wpextract.extractors.data.images import resolve_images
from wpextract.extractors.data.link_resolver import resolve_links
from wpextract.extractors.data.links import LinkRegistry
from wpextract.extractors.io import load_df
from wpextract.parse.content import extract_content_data
from wpextract.parse.html import extract_html_text, parse_html
from wpextract.parse.translations import PickerListType, extract_translations
from wpextract.parse.translations._resolver import TranslationLink
from wpextract.scrape.scrape import load_scrape
from wpextract.util.locale import extract_locale

EXPORT_COLUMNS = [
"author",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@

import pandas as pd

from extractor.extractors.data.links import LinkRegistry
from extractor.extractors.io import load_df
from extractor.util.locale import extract_locale
from wpextract.extractors.data.links import LinkRegistry
from wpextract.extractors.io import load_df
from wpextract.util.locale import extract_locale

EXPORT_COLUMNS = [
"count",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

import pandas as pd

from extractor.extractors.io import load_df
from wpextract.extractors.io import load_df

EXPORT_COLUMNS = ["avatar", "description", "link", "name", "slug", "url"]

Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@
import pandas as pd
from bs4 import BeautifulSoup, Comment, NavigableString

from extractor.extractors.data.images import MediaUse, ResolvableMediaUse
from extractor.extractors.data.links import Link, ResolvableLink
from extractor.extractors.media import get_caption
from extractor.util.str import squash_whitespace
from wpextract.extractors.data.images import MediaUse, ResolvableMediaUse
from wpextract.extractors.data.links import Link, ResolvableLink
from wpextract.extractors.media import get_caption
from wpextract.util.str import squash_whitespace

EXCLUDED_CONTENT_TAGS = {"figcaption"}
NEWLINE_TAGS = {"br", "p"}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from bs4 import BeautifulSoup

from extractor.util.str import squash_whitespace
from wpextract.util.str import squash_whitespace

PROBABLY_HTML = re.compile(r"<|&\S+;")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import pandas as pd
from bs4 import BeautifulSoup

import extractor.parse.translations._pickers as pickers
import wpextract.parse.translations._pickers as pickers

PICKERS = [pickers.Polylang, pickers.GenericLangSwitcher]
PickerListType = List[Type[pickers.LangPicker]]
Expand Down
Loading

0 comments on commit e79e980

Please sign in to comment.