diff --git a/.gitignore b/.gitignore
index 45b97c6..c908628 100644
--- a/.gitignore
+++ b/.gitignore
@@ -61,4 +61,7 @@ MANIFEST
## == JAVASCRIPT ==
-node_modules/
\ No newline at end of file
+node_modules/
+
+## == DOCUMENTATION ==
+site/
\ No newline at end of file
diff --git a/README.md b/README.md
index 88a5f35..42d9bc5 100644
--- a/README.md
+++ b/README.md
@@ -1,215 +1,44 @@
-# WordPress Site Extractor
+# WPextract - WordPress Site Extractor
-Processes an API dump of a WordPress site into a dataset, including identifying parallel multilingual articles, and resolving internal links and media.
+**WPextract is a tool to create datasets from WordPress sites.**
+
+- Archives posts, pages, tags, categories, media (including files), comments, and users
+- Uses the WordPress API to guarantee 100% accurate and complete content
+- Resolves internal links and media to IDs
+- Automatically parses multilingual sites to create parallel datasets
> [!NOTE]
> This software was developed for our EMNLP 2023 paper [_Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study_](https://aclanthology.org/2023.emnlp-main.349/). The code has been updated since the paper was written; for archival purposes, the precise version used for the study is [available on Zenodo](https://zenodo.org/records/10008086).
-## Referencing
-
-We'd love to hear about your use of our tool, you can [email us](mailto:frheppell1@sheffield.ac.uk) to let us know! Feel free to create issues and/or pull requests for new features or bugs.
-
-If you use this tool in published work, please cite [our EMNLP paper](https://aclanthology.org/2023.emnlp-main.349/):
-
-
-BibTeX Citation
-
-```bibtex
-@inproceedings{heppell-etal-2023-analysing,
- title = "Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study",
- author = "Heppell, Freddy and
- Bontcheva, Kalina and
- Scarton, Carolina",
- editor = "Bouamor, Houda and
- Pino, Juan and
- Bali, Kalika",
- booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
- month = dec,
- year = "2023",
- address = "Singapore",
- publisher = "Association for Computational Linguistics",
- url = "https://aclanthology.org/2023.emnlp-main.349",
- pages = "5729--5741",
- doi = "10.18653/v1/2023.emnlp-main.349"
-}
-```
-
` tags and `
` tags with newline characters - 3. Combine all page text - -Other types are extracted in similar ways. Any additional user-supplied fields with HTML formatting (such as media captions) are also extracted as plain text. - -### 4. Translation Normalisation and Link Resolution - -Translations are normalised by checking that for every translation relation (e.g. `en` -> `fr`), the reverse exists. If not, it will be added. - - -After all types have been processed, the link registry is used to process the unresolved links, translations and media. - -For every resolution, the following steps are performed: -1. Remove the `preview_id` query parameter from the URL if present -2. Attempt to look up the URL in the link registry -3. If unsuccessful, use a heuristic to detect category slugs in the URL and try without them - * We do this in case sites have removed category slugs from the permalink at some point. -4. If unsuccessful, warn that the URL is unresolvable - -For each resolved link, translation, or media, a destination is set containing its normalised URL, data type, and ID. - -### 5. Export - -The columns of each type are subset and exported as a JSON file each. - -## Acknowledgements and License +We'd love to hear about your use of our tool, you can [email us](mailto:frheppell1@sheffield.ac.uk) to let us know! Feel free to create issues and/or pull requests for new features or bugs. -This software is made available under the terms of the [Apache License version 2.0](LICENSE). +If you use this tool in published work, please cite [our EMNLP paper](https://aclanthology.org/2023.emnlp-main.349/): -Portions of this software are derived from other works, see [the `NOTICE` file](NOTICE) for further information. \ No newline at end of file +> Freddy Heppell, Kalina Bontcheva, and Carolina Scarton. 2023. Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5729–5741, Singapore. Association for Computational Linguistics. diff --git a/docs/advanced/library.md b/docs/advanced/library.md new file mode 100644 index 0000000..734d4f3 --- /dev/null +++ b/docs/advanced/library.md @@ -0,0 +1,27 @@ +# Using as a Library + +The extractor can also be used as a library instead of on the command line. + +Typically, you would: + +- instantiate a [`WPDownloader`][extractor.WPDownloader] instance and call its [`download`][extractor.WPDownloader.download] method. +- instantiate a [`WPExtractor`][extractor.WPExtractor] instance and call its `extract` method. The dataframes can then be accessed as class attributes or exported with the `export` method. + +Examples of usage are available in the CLI scripts in the `extractor.cli` module. + + + +## Downloader + +Use the [`extractor.WPDownloader`][extractor.WPDownloader] class. + +Possible customisations include: + +- Implement highly custom request behaviour by subclassing [`RequestSession`][extractor.dl.RequestSession] and passing to the `session` parameter. + + +## Extractor + +Use the [`extractor.WPExtractor`][extractor.WPExtractor] class. + +When using this approach, it's possible to use [customised translation pickers](../advanced/multilingual.md#adding-support) by passing subclasses of [`LanguagePicker`][extractor.parse.translations.LangPicker] to the diff --git a/docs/advanced/multilingual.md b/docs/advanced/multilingual.md new file mode 100644 index 0000000..b29a209 --- /dev/null +++ b/docs/advanced/multilingual.md @@ -0,0 +1,122 @@ +# Multilingual Sites + +If sites publish in multiple languages and use a plugin to present a list of language versions, wpextract can parse this and add multilingual data in the output dataset. + +## Extraction Process + +Extracting multilingual data is performed during the [extract command](../usage/extract.md). This data isn't available in the WordPress REST API response, so instead must be obtained from scraped HTML. + +Obtaining the scraped HTML is relatively straightforward, as we already have a list of all posts from the [download command](../usage/download.md). + +One way this could be scraped is using `jq` to parse the downloaded posts file and produce a URL list, then `wget` to download each page: + +```shell-session +$ cat posts.json | jq -r '.[] | .link' > url_list.txt +$ touch rejected.log +$ wget --adjust-extension --input-file=url_list.txt \ + --wait 1 --random-wait --force-directories \ + --rejected-log=rejected.log +``` + +When running [the extract command](../usage/extract.md), pass this directory as the `--scrape-root` argument. The scrape will be crawled to match URLs to downloaded HTML files following [this process](../usage/extract.md#1-scrape-crawling-optional). + + +## Supported Plugins + +wpextract uses an extensible system of parsers to find language picker elements and extract their data. + +Currently the following plugins are supported: + +### Polylang +[Plugin Page](https://wordpress.org/plugins/polylang/) · [Website](https://polylang.pro/) + +**Supports**: + +- Adding as a widget (e.g. to a sidebar) + + ??? example + ```html + --8<-- "tests/parse/translations/test_pickers/polylang.html:struct" + ``` + + +- Adding to the navbar as a custom dropdown[^dropdown] + + ??? example + ```html + --8<-- "tests/parse/translations/test_pickers/generic_polylang.html:struct" + ``` + +**Does not support**: + +- Methods which show the picker as a `