minor cli and docs fixes

GateNLP · Jul 10, 2024 · 6408309 · 6408309
1 parent e5a9838
commit 6408309
Show file tree

Hide file tree

Showing 3 changed files with 8 additions and 3 deletions.
diff --git a/docs/usage/download.md b/docs/usage/download.md
@@ -18,6 +18,9 @@ $ wpextract dl target out_json
 `--json-prefix JSON_PREFIX`
 : Output files with a prefix, e.g. supplying _20240101-example_ will output posts to `out_dir/20240101-example-posts.json`
 
+`--media-dest`
+: Path to download media files to, skipped if not supplied. Must be an empty directory
+
 **skip data**
 
 `--no-categories` `--no-media` `--no-pages` `--no-posts` `--no-tags` `--no-users`

diff --git a/docs/usage/extract.md b/docs/usage/extract.md
@@ -36,7 +36,7 @@ $ wpextract extract json_root out_dir
 
 ### 1. Scrape Crawling (optional)
 
-If a scrape is provided with the `--scrape-root` argument, it is first crawled to map the correspondance between the HTML files on disk and the post URLs.
+If a scrape is provided with the `--scrape-root` argument, it is first crawled to map the correspondence between the HTML files on disk and the post URLs.
 
 Website scraping tools may store a webpage at a path that is not easy to derive from the URL (e.g. because of path length limits). For this reason, we crawl the scrape directory and build a mapping of URL to path.
 
@@ -62,7 +62,7 @@ The extraction process is applied to all posts simultaneously in the following o
    * Translations are detected using the translation pickers (implementing [`LangPicker`][wpextract.parse.translations.LangPicker])
    * Custom pickers can be added if using this tool as a library
    * Any extracted translations are stored as unresolved links
-5. Add the post's link to the link registry
+5. Add the post's link to the link registry[^linkregistry]
 6. Using the parsed API content response, extract:
    * Internal links (stored as unresolved links)
    * External links (stored as resolved links)
@@ -73,6 +73,8 @@ The extraction process is applied to all posts simultaneously in the following o
      2. Replace `<br>` tags and `<p>` tags with newline characters
      3. Combine all page text
 
+[^linkregistry]: The link registry stores a map between URLs of posts, pages, media etc. and their data type and ID. This is later used to resolve hyperlinks and media use.
+
 Other types are extracted in similar ways. Any additional user-supplied fields with HTML formatting (such as media captions) are also extracted as plain text.
 
 ### 4. Translation Normalisation and Link Resolution

diff --git a/src/wpextract/downloader.py b/src/wpextract/downloader.py
@@ -147,7 +147,7 @@ def export_decorator(
         json_path: Path,
         json_prefix: str,
         values: Any,
-        kwargs=None,
+        kwargs: Optional[dict] = None,
     ) -> None:
         """Call the export function with a constructed filename.