Skip to content

Commit

Permalink
minor cli and docs fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
freddyheppell committed Jul 10, 2024
1 parent e5a9838 commit 6408309
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 3 deletions.
3 changes: 3 additions & 0 deletions docs/usage/download.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ $ wpextract dl target out_json
`--json-prefix JSON_PREFIX`
: Output files with a prefix, e.g. supplying _20240101-example_ will output posts to `out_dir/20240101-example-posts.json`

`--media-dest`
: Path to download media files to, skipped if not supplied. Must be an empty directory

**skip data**

`--no-categories` `--no-media` `--no-pages` `--no-posts` `--no-tags` `--no-users`
Expand Down
6 changes: 4 additions & 2 deletions docs/usage/extract.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ $ wpextract extract json_root out_dir

### 1. Scrape Crawling (optional)

If a scrape is provided with the `--scrape-root` argument, it is first crawled to map the correspondance between the HTML files on disk and the post URLs.
If a scrape is provided with the `--scrape-root` argument, it is first crawled to map the correspondence between the HTML files on disk and the post URLs.

Website scraping tools may store a webpage at a path that is not easy to derive from the URL (e.g. because of path length limits). For this reason, we crawl the scrape directory and build a mapping of URL to path.

Expand All @@ -62,7 +62,7 @@ The extraction process is applied to all posts simultaneously in the following o
* Translations are detected using the translation pickers (implementing [`LangPicker`][wpextract.parse.translations.LangPicker])
* Custom pickers can be added if using this tool as a library
* Any extracted translations are stored as unresolved links
5. Add the post's link to the link registry
5. Add the post's link to the link registry[^linkregistry]
6. Using the parsed API content response, extract:
* Internal links (stored as unresolved links)
* External links (stored as resolved links)
Expand All @@ -73,6 +73,8 @@ The extraction process is applied to all posts simultaneously in the following o
2. Replace `<br>` tags and `<p>` tags with newline characters
3. Combine all page text

[^linkregistry]: The link registry stores a map between URLs of posts, pages, media etc. and their data type and ID. This is later used to resolve hyperlinks and media use.

Other types are extracted in similar ways. Any additional user-supplied fields with HTML formatting (such as media captions) are also extracted as plain text.

### 4. Translation Normalisation and Link Resolution
Expand Down
2 changes: 1 addition & 1 deletion src/wpextract/downloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ def export_decorator(
json_path: Path,
json_prefix: str,
values: Any,
kwargs=None,
kwargs: Optional[dict] = None,
) -> None:
"""Call the export function with a constructed filename.
Expand Down

0 comments on commit 6408309

Please sign in to comment.