docs: update info, links and dependency (#773)

* update docs * lint text
adbar · Dec 28, 2024 · 42ada5a · 42ada5a
1 parent 91c567c
commit 42ada5a
Show file tree

Hide file tree

Showing 15 changed files with 55 additions and 86 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,42 +1,38 @@
 ## How to contribute
 
-Your contributions make the software and its documentation better. A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.
 
+If you value this software or depend on it for your product,
+consider sponsoring it and contributing to its codebase.
+Your support will help ensure the sustainability and growth of the project.
 
-There are many ways to contribute, you could:
+There are many ways to contribute:
 
-  * Improve the documentation: Write tutorials and guides, correct mistakes, or translate existing content.
+  * Sponsor the project: Show your appreciation [on GitHub](https://github.com/sponsors/adbar) or [ko-fi.com](https://ko-fi.com/adbarbaresi).
   * Find bugs and submit bug reports: Help making Trafilatura an even more robust tool.
+  * Write code: Fix bugs or add new features by writing [pull requests](https://docs.github.com/en/pull-requests) with a list of what you have done.
+  * Improve the documentation: Write tutorials and guides, correct mistakes, or translate existing content.
   * Submit feature requests: Share your feedback and suggestions.
-  * Write code: Fix bugs or add new features.
 
 
 Here are some important resources:
 
   * [List of currently open issues](https://github.com/adbar/trafilatura/issues) (no pretention to exhaustivity!)
   * [How to contribute to open source](https://opensource.guide/how-to-contribute/)
 
+A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.
+
 
 ## Testing and evaluating the code
 
-Here is how you can run the tests and code quality checks:
+Here is how you can run the tests and code quality checks. Pull requests will only be accepted if the changes are tested and if they there are no errors. 
 
 - Install the necessary packages with `pip install trafilatura[dev]`
 - Run `pytest` from trafilatura's directory, or select a particular test suite, for example `realworld_tests.py`, and run `pytest realworld_tests.py` or simply `python3 realworld_tests.py`
 - Run `mypy` on the directory: `mypy trafilatura/`
-- See also the [tests Readme](tests/README.rst) for information on the evaluation benchmark
-
-Pull requests will only be accepted if they there are no errors in pytest and mypy.
 
 If you work on text extraction it is useful to check if performance is equal or better on the benchmark.
 
-
-## Submitting changes
-
-Please send a pull request to Trafilatura with a list of what you have done (read more about [pull requests](http://help.github.com/pull-requests/)).
-
-**Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github)
-
+See the [tests Readme](tests/README.rst) for more information.
 
 
 For further questions you can use [GitHub issues](https://github.com/adbar/trafilatura/issues) and discussion pages, or [E-Mail](https://adrien.barbaresi.eu/).

diff --git a/README.md b/README.md
@@ -141,13 +141,12 @@ This work started as a PhD project at the crossroads of linguistics and
 NLP, this expertise has been instrumental in shaping Trafilatura over
 the years. Initially launched to create text databases for research purposes
 at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
-this package continues to be maintained but its future development
-depends on community support.
+this package continues to be maintained but its future depends on community support.
 
 **If you value this software or depend on it for your product, consider
-sponsoring it and contributing to its codebase**. Your support will
-help maintain and enhance this popular package, ensuring its growth,
-robustness, and accessibility for developers and users around the world.
+sponsoring it and contributing to its codebase**. Your support
+[on GitHub](https://github.com/sponsors/adbar) or [ko-fi.com](https://ko-fi.com/adbarbaresi)
+will help maintain and enhance this popular package.
 
 *Trafilatura* is an Italian word for [wire
 drawing](https://en.wikipedia.org/wiki/Wire_drawing) symbolizing the

diff --git a/docs/conf.py b/docs/conf.py
@@ -20,8 +20,8 @@
 
 # -- Project information -----------------------------------------------------
 
-project = 'trafilatura'
-copyright = '2024, Adrien Barbaresi'
+project = 'Trafilatura'
+copyright = '2025, Adrien Barbaresi'
 html_show_sphinx = False
 author = 'Adrien Barbaresi'
 version = trafilatura.__version__
@@ -88,7 +88,6 @@
 ## pydata options
 html_theme_options = {
   "github_url": "https://github.com/adbar/trafilatura",
-  "twitter_url": "https://twitter.com/adbarbaresi",
   "external_links": [
       {"name": "Blog", "url": "https://adrien.barbaresi.eu/blog/tag/trafilatura.html"},
   ],

diff --git a/docs/corpus-data.rst b/docs/corpus-data.rst
@@ -45,7 +45,7 @@ Formats and software used in corpus linguistics
 
 Input/Output formats: TXT, XML and XML-TEI are quite frequent in corpus linguistics.
 
-- Han., N.-R. (2022). "`Transforming Data <https://doi.org/10.7551/mitpress/12200.003.0010>`_", The Open Handbook of Linguistic Data.
+- Han., N.-R. (2022). "Transforming Data", The Open Handbook of Linguistic Data.
 
 
 The XML and XML-TEI formats
@@ -62,9 +62,9 @@ Corpus analysis tools
 - `CorpusExplorer <https://notes.jan-oliver-ruediger.de/software/corpusexplorer-overview/>`_ supports CSV, TXT and various XML formats
 - `Corpus Workbench (CWB) <https://cwb.sourceforge.io/>`_ uses verticalized texts whose origin can be in TXT or XML format
 - `LancsBox <http://corpora.lancs.ac.uk/lancsbox/>`_ support various formats, notably TXT & XML
-- `TXM <http://textometrie.ens-lyon.fr/?lang=en>`_ (textometry platform) can take TXT, XML & XML-TEI files as input
+- `TXM <https://txm.gitpages.huma-num.fr/textometrie/en/>`_ (textometry platform) can take TXT, XML & XML-TEI files as input
 - `Voyant <https://voyant-tools.org/>`_ support various formats, notably TXT, XML & XML-TEI
-- `Wmatrix <http://ucrel.lancs.ac.uk/wmatrix/>`_ can work with TXT and XML
+- `Wmatrix <https://ucrel.lancs.ac.uk/wmatrix/>`_ can work with TXT and XML
 - `WordSmith <https://lexically.net/wordsmith/index.html>`_ supports TXT and XML
 
 Further corpus analysis software can be found on `corpus-analysis.com <https://corpus-analysis.com/>`_.

diff --git a/docs/index.rst b/docs/index.rst
@@ -118,29 +118,21 @@ This package is distributed under the `Apache 2.0 license <https://www.apache.or
 Versions prior to v1.8.0 are under GPLv3+ license.
 
 
-
-Contributing
-~~~~~~~~~~~~
-
-Contributions of all kinds are welcome. Visit the `Contributing page <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ for more information. Bug reports can be filed on the `dedicated issue page <https://github.com/adbar/trafilatura/issues>`_.
-
-Many thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who extended the docs or submitted bug reports, features and bugfixes!
-
-
 Context
 -------
 
 This work started as a PhD project at the crossroads of linguistics and NLP,
 this expertise has been instrumental in shaping Trafilatura over the years. 
 Initially launched to create text databases for research purposes
 at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
-this package continues to be maintained but its future development
-depends on community support.
+this package continues to be maintained but its future depends on community support.
 
 **If you value this software or depend on it for your product, consider
-sponsoring it and contributing to its codebase**. Your support will
-help maintain and enhance this popular package, ensuring its growth,
-robustness, and accessibility for developers and users around the world.
+sponsoring it and contributing to its codebase**. Your support
+`on GitHub <https://github.com/sponsors/adbar>`_ or `ko-fi.com <https://ko-fi.com/adbarbaresi>`_
+will help maintain and enhance this popular package.
+Visit the `Contributing page <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_
+for more information.
 
 *Trafilatura* is an Italian word for `wire drawing <https://en.wikipedia.org/wiki/Wire_drawing>`_ symbolizing the refinement and conversion process. It is also the way shapes of pasta are formed.
 
@@ -225,8 +217,8 @@ Further documentation
    usage
    tutorials
    evaluation
-   corefunctions
    used-by
+   corefunctions
    background
 
 :ref:`genindex`
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,6 +1,6 @@
 # with version specifier
 sphinx>=8.1.3
-pydata-sphinx-theme>=0.16.0
+pydata-sphinx-theme>=0.16.1
 docutils>=0.21.2
 sphinx-sitemap>=2.6.0
 

diff --git a/docs/sources.rst b/docs/sources.rst
@@ -39,18 +39,18 @@ Corpora
 URL lists from corpus linguistic projects can be a starting ground to derive information from, either to recreate existing corpora or to re-crawl the websites and find new content. If the websites do not exist anymore, the links can still be useful as the corresponding web pages can be retrieved from web archives.
 
 - `Sources for the Internet Corpora <http://corpus.leeds.ac.uk/internet.html>`_ of the Leeds Centre for Translation Studies
-- `Link data sets <https://www.webcorpora.org/opendata/links/>`_  of the COW project
+- `Link data sets <http://www.webcorpora.org/opendata/links/>`_  of the COW project
 
 
 URL directories
 ~~~~~~~~~~~~~~~
 
-- `Overview of the Web archiving community <https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community>`_
+- `Overview of the Web archiving community <https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community>`_
 - `lazynlp list of sources <https://github.com/chiphuyen/lazynlp>`_
 
 DMOZ (now an archive) and Wikipedia work quite well as primary sources:
 
-- `Qualification of URLs extracted from DMOZ and Wikipedia <https://tel.archives-ouvertes.fr/tel-01167309/document#page=189>`_ (PhD thesis section)
+- `Qualification of URLs extracted from DMOZ and Wikipedia <https://theses.hal.science/tel-01167309/document#page=189>`_ (PhD thesis section)
 
 ..
    https://www.sketchengine.eu/guide/create-a-corpus-from-the-web/
@@ -130,14 +130,10 @@ Social networks
 
 Series of surface scrapers that crawl the networks without even logging in, thus circumventing the API restrictions. Development of such software solutions is fast-paced, so no links will be listed here at the moment.
 
-Previously collected tweet IDs can be “hydrated”, i.e. retrieved from Twitter in bulk. see for instance:
-
-- `Twitter datasets for research and archiving <https://tweetsets.library.gwu.edu/>`_
-- `Search GitHub for Tweet IDs <https://github.com/search?q=tweet+ids>`_
+Previously collected tweet IDs can be “hydrated”, i.e. retrieved from Twitter in bulk.
 
 Links can be extracted from tweets with a regular expression such as ``re.findall(r'https?://[^ ]+')``. They probably need to be resolved first to get actual link targets and not just shortened URLs (like t.co/…).
 
-
 For further ideas from previous projects see references below.
 
 

diff --git a/docs/troubleshooting.rst b/docs/troubleshooting.rst
@@ -34,7 +34,7 @@ Beyond raw HTML
 
 While downloading and processing raw HTML documents is much faster, it can be necessary to fully render the web page before further processing, e.g. because a page makes exhaustive use of JavaScript or because content is injected from multiple sources.
 
-In such cases the way to go is to use a browser automation library like `Playwright <https://playwright.dev/python/docs/library/>`_. For available alternatives see this `list of headless browsers <https://github.com/dhamaniasad/HeadlessBrowsers>`_.
+In such cases the way to go is to use a browser automation library like Playwright. For available alternatives see this `list of headless browsers <https://github.com/dhamaniasad/HeadlessBrowsers>`_.
 
 For more refined masking and automation methods, see the `nodriver <https://github.com/ultrafunkamsterdam/nodriver>`_ and `browserforge <https://github.com/daijro/browserforge>`_ packages.
 
@@ -43,7 +43,7 @@ For more refined masking and automation methods, see the `nodriver <https://gith
 Bypassing paywalls
 ^^^^^^^^^^^^^^^^^^
 
-A browser automation library can also be useful to bypass issues related to cookies and paywalls as it can be combined with a corresponding browser extension, e.g. `iamdamdev's bypass-paywalls-chrome <https://github.com/iamadamdev/bypass-paywalls-chrome>`_ or `this alternative by magnolia1234 <https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean>`_.
+A browser automation library can also be useful to bypass issues related to cookies and paywalls as it can be combined with a corresponding browser extension, e.g. iamdamdev's bypass-paywalls-chrome and available alternatives.
 
 
 

diff --git a/docs/tutorial-dwds.rst b/docs/tutorial-dwds.rst
@@ -82,7 +82,7 @@ So wird die DWDS-Plattform zu einer Art Meta-Suchmaschine. Der Vorteil besteht d
     Hier finden Sie eine `Liste der Webkorpora auf der DWDS-Plattform <https://www.dwds.de/d/k-web>`_.
 
 
-Bei größeren Webkorpora ist die Filterung hinsichtlich der Relevanz und der Textqualität meistens quantitativer Natur, siehe `Barbaresi 2015 (Diss.) Kapitel 4 <https://tel.archives-ouvertes.fr/tel-01167309/document>`_ für Details. Im Übrigen haben wir das Schlimmste aus dem Web manuell ausgegrenzt.
+Bei größeren Webkorpora ist die Filterung hinsichtlich der Relevanz und der Textqualität meistens quantitativer Natur, siehe `Barbaresi 2015 (Diss.) Kapitel 4 <https://theses.hal.science/tel-01167309/document>`_ für Details. Im Übrigen haben wir das Schlimmste aus dem Web manuell ausgegrenzt.
 
 
 Download und Verarbeitung der Daten

diff --git a/docs/tutorial-epsilla.rst b/docs/tutorial-epsilla.rst
@@ -32,11 +32,7 @@ Alternatives include `Qdrant <https://github.com/qdrant/qdrant>`_, `Redis <https
 Setup Epsilla
 -------------
 
-In this tutorial, we will need an Epsilla database server. There are two ways to get one: use the free cloud version or start one locally.
-
-Epsilla has a `cloud version <https://cloud.epsilla.com//?ref=trafilatura>`_ with a free tier. You can sign up and get a server running in a few steps.
-
-Alternatively, you can start one locally with a `Docker <https://docs.docker.com/get-started/>`_ image.
+In this tutorial, we will need an Epsilla database server. You can start one locally with a `Docker <https://docs.docker.com/get-started/>`_ image.
 
 .. code-block:: bash
 
@@ -155,7 +151,4 @@ We can now perform a vector search to find the most relevant project based on a
 
 You will see the returned response is React! That is the correct answer. React is a modern frontend library, but PyTorch and Tensorflow are not.
 
-.. image:: https://static.scarf.sh/a.png?x-pxid=51f549d1-aabf-473c-b971-f8d9c3ac8ac5
-    :alt: 
-
 
diff --git a/docs/tutorial0.rst b/docs/tutorial0.rst
@@ -193,7 +193,7 @@ The output directory can be created on demand, but it must be writable.
     # output in XML format, backup of HTML files
     $ trafilatura --xml -i list.txt -o xmlfiles/ --backup-dir htmlfiles/
 
-The second and third instructions create a collection of `XML files <https://en.wikipedia.org/wiki/XML>`_ which can be edited with a basic text editor or a full-fledged text-editing software or IDE such as the `Atom editor <https://atom.io/>`_.
+The second and third instructions create a collection of `XML files <https://en.wikipedia.org/wiki/XML>`_ which can be edited with a basic text editor or a full-fledged text-editing software or IDE.
 
 
 .. hint::

diff --git a/docs/usage-api.rst b/docs/usage-api.rst
@@ -3,32 +3,32 @@ API
 
 .. meta::
     :description lang=en:
-        See how to use the official Trafilatura API to download and extract data for free or for larger volumes.
+        See how to use the official Trafilatura API to download and extract data.
 
 
 Introduction
 ------------
 
-Simplify the process of turning URLs and HTML into structured, meaningful data!
-
-Use the last version of the software straight from the application programming interface. The API allows you to access the capabilities of Trafilatura, a web scraping and data extraction library, directly from your applications and projects.
-
 With the Trafilatura API, you can:
 
 - Download URLs or provide your own data, including web scraping capabilities
 - Configure the output format to suit your needs, with support for multiple use cases
 
-
 This is especially useful if you want to try out the software without installing it or if you want to support the project while saving time.
 
 
-Endpoints
----------
+.. warning::
+    The API is currently unavailable, feel free to get in touch for any inquiries.
 
-The Trafilatura API comes in two versions, available from two different gateways:
 
-- `Free for demonstration purposes <https://trafilatura.mooo.com>`_ (including documentation page)
-- `For a larger volume of requests <https://rapidapi.com/trafapi/api/trafilatura>`_ (documentation with code snippets and plans)
+..
+    Endpoints
+    ---------
+
+    The Trafilatura API comes in two versions, available from two different gateways:
+
+    - `Free for demonstration purposes <https://trafilatura.mooo.com>`_ (including documentation page)
+    - `For a larger volume of requests <https://rapidapi.com/trafapi/api/trafilatura>`_ (documentation with code snippets and plans)
 
 
 Making JSON requests
@@ -103,5 +103,3 @@ Further information
 -------------------
 
 Please note that the underlying code is not currently open-sourced, feel free to reach out for specific use cases or collaborations.
-
-With the API, you can focus on building your applications and projects, while leaving the heavy lifting to Trafilatura.
diff --git a/docs/usage-gui.rst b/docs/usage-gui.rst
@@ -39,7 +39,7 @@ Troubleshooting
 Mac OS X:
 
 - ``This program needs access to the screen...`` This problem is related to the way you installed Python or the shell you're running:
-    1. Clone the repository and start with "python trafilatura_gui/interface.py" (`source <https://docs.python.org/3/using/mac.html#running-scripts-with-a-gui>`_)
+    1. Clone the repository and start with "python trafilatura_gui/interface.py" (`source <https://docs.python.org/3/using/mac.html>`_)
     2. `Configure your virtual environment <https://wiki.wxpython.org/wxPythonVirtualenvOnMac>`_ (Python3 and wxpython 4.1.0)
 
 

diff --git a/docs/usage-r.rst b/docs/usage-r.rst
@@ -11,7 +11,7 @@ Introduction
 ------------
 
 
-R is a free software environment for statistical computing and graphics. `Reticulate <https://rstudio.github.io/reticulate>`_ is an R package that enables easy interoperability between R and Python. With Reticulate, you can import Python modules as if they were R packages and call Python functions from R.
+R is a free software environment for statistical computing and graphics. `Reticulate <https://rstudio.github.io/reticulate/>`_ is an R package that enables easy interoperability between R and Python. With Reticulate, you can import Python modules as if they were R packages and call Python functions from R.
 
 This allows R users to leverage the vast array of Python packages and tools and basically allows for execution of Python code inside an R session. Python packages can then be used with minimal adaptations rather than having to go back and forth between languages and environments.
-Original file line number
+Diff line change
@@ Expand Up / @@ -11,7 +11,7 @@ Introduction @@
     ------------
-    R is a free software environment for statistical computing and graphics. `Reticulate <https://rstudio.github.io/reticulate>`_ is an R package that enables easy interoperability between R and Python. With Reticulate, you can import Python modules as if they were R packages and call Python functions from R.
+    R is a free software environment for statistical computing and graphics. `Reticulate <https://rstudio.github.io/reticulate/>`_ is an R package that enables easy interoperability between R and Python. With Reticulate, you can import Python modules as if they were R packages and call Python functions from R.
     This allows R users to leverage the vast array of Python packages and tools and basically allows for execution of Python code inside an R session. Python packages can then be used with minimal adaptations rather than having to go back and forth between languages and environments.
@@ Expand Down @@