Skip to content

Commit

Permalink
docs: update info, links and dependency (#773)
Browse files Browse the repository at this point in the history
* update docs

* lint text
  • Loading branch information
adbar authored Dec 28, 2024
1 parent 91c567c commit 42ada5a
Show file tree
Hide file tree
Showing 15 changed files with 55 additions and 86 deletions.
26 changes: 11 additions & 15 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,38 @@
## How to contribute

Your contributions make the software and its documentation better. A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.

If you value this software or depend on it for your product,
consider sponsoring it and contributing to its codebase.
Your support will help ensure the sustainability and growth of the project.

There are many ways to contribute, you could:
There are many ways to contribute:

* Improve the documentation: Write tutorials and guides, correct mistakes, or translate existing content.
* Sponsor the project: Show your appreciation [on GitHub](https://github.com/sponsors/adbar) or [ko-fi.com](https://ko-fi.com/adbarbaresi).
* Find bugs and submit bug reports: Help making Trafilatura an even more robust tool.
* Write code: Fix bugs or add new features by writing [pull requests](https://docs.github.com/en/pull-requests) with a list of what you have done.
* Improve the documentation: Write tutorials and guides, correct mistakes, or translate existing content.
* Submit feature requests: Share your feedback and suggestions.
* Write code: Fix bugs or add new features.


Here are some important resources:

* [List of currently open issues](https://github.com/adbar/trafilatura/issues) (no pretention to exhaustivity!)
* [How to contribute to open source](https://opensource.guide/how-to-contribute/)

A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.


## Testing and evaluating the code

Here is how you can run the tests and code quality checks:
Here is how you can run the tests and code quality checks. Pull requests will only be accepted if the changes are tested and if they there are no errors.

- Install the necessary packages with `pip install trafilatura[dev]`
- Run `pytest` from trafilatura's directory, or select a particular test suite, for example `realworld_tests.py`, and run `pytest realworld_tests.py` or simply `python3 realworld_tests.py`
- Run `mypy` on the directory: `mypy trafilatura/`
- See also the [tests Readme](tests/README.rst) for information on the evaluation benchmark

Pull requests will only be accepted if they there are no errors in pytest and mypy.

If you work on text extraction it is useful to check if performance is equal or better on the benchmark.


## Submitting changes

Please send a pull request to Trafilatura with a list of what you have done (read more about [pull requests](http://help.github.com/pull-requests/)).

**Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github)

See the [tests Readme](tests/README.rst) for more information.


For further questions you can use [GitHub issues](https://github.com/adbar/trafilatura/issues) and discussion pages, or [E-Mail](https://adrien.barbaresi.eu/).
Expand Down
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,13 +141,12 @@ This work started as a PhD project at the crossroads of linguistics and
NLP, this expertise has been instrumental in shaping Trafilatura over
the years. Initially launched to create text databases for research purposes
at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
this package continues to be maintained but its future development
depends on community support.
this package continues to be maintained but its future depends on community support.

**If you value this software or depend on it for your product, consider
sponsoring it and contributing to its codebase**. Your support will
help maintain and enhance this popular package, ensuring its growth,
robustness, and accessibility for developers and users around the world.
sponsoring it and contributing to its codebase**. Your support
[on GitHub](https://github.com/sponsors/adbar) or [ko-fi.com](https://ko-fi.com/adbarbaresi)
will help maintain and enhance this popular package.

*Trafilatura* is an Italian word for [wire
drawing](https://en.wikipedia.org/wiki/Wire_drawing) symbolizing the
Expand Down
5 changes: 2 additions & 3 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@

# -- Project information -----------------------------------------------------

project = 'trafilatura'
copyright = '2024, Adrien Barbaresi'
project = 'Trafilatura'
copyright = '2025, Adrien Barbaresi'
html_show_sphinx = False
author = 'Adrien Barbaresi'
version = trafilatura.__version__
Expand Down Expand Up @@ -88,7 +88,6 @@
## pydata options
html_theme_options = {
"github_url": "https://github.com/adbar/trafilatura",
"twitter_url": "https://twitter.com/adbarbaresi",
"external_links": [
{"name": "Blog", "url": "https://adrien.barbaresi.eu/blog/tag/trafilatura.html"},
],
Expand Down
6 changes: 3 additions & 3 deletions docs/corpus-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Formats and software used in corpus linguistics

Input/Output formats: TXT, XML and XML-TEI are quite frequent in corpus linguistics.

- Han., N.-R. (2022). "`Transforming Data <https://doi.org/10.7551/mitpress/12200.003.0010>`_", The Open Handbook of Linguistic Data.
- Han., N.-R. (2022). "Transforming Data", The Open Handbook of Linguistic Data.


The XML and XML-TEI formats
Expand All @@ -62,9 +62,9 @@ Corpus analysis tools
- `CorpusExplorer <https://notes.jan-oliver-ruediger.de/software/corpusexplorer-overview/>`_ supports CSV, TXT and various XML formats
- `Corpus Workbench (CWB) <https://cwb.sourceforge.io/>`_ uses verticalized texts whose origin can be in TXT or XML format
- `LancsBox <http://corpora.lancs.ac.uk/lancsbox/>`_ support various formats, notably TXT & XML
- `TXM <http://textometrie.ens-lyon.fr/?lang=en>`_ (textometry platform) can take TXT, XML & XML-TEI files as input
- `TXM <https://txm.gitpages.huma-num.fr/textometrie/en/>`_ (textometry platform) can take TXT, XML & XML-TEI files as input
- `Voyant <https://voyant-tools.org/>`_ support various formats, notably TXT, XML & XML-TEI
- `Wmatrix <http://ucrel.lancs.ac.uk/wmatrix/>`_ can work with TXT and XML
- `Wmatrix <https://ucrel.lancs.ac.uk/wmatrix/>`_ can work with TXT and XML
- `WordSmith <https://lexically.net/wordsmith/index.html>`_ supports TXT and XML

Further corpus analysis software can be found on `corpus-analysis.com <https://corpus-analysis.com/>`_.
Expand Down
22 changes: 7 additions & 15 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,29 +118,21 @@ This package is distributed under the `Apache 2.0 license <https://www.apache.or
Versions prior to v1.8.0 are under GPLv3+ license.



Contributing
~~~~~~~~~~~~

Contributions of all kinds are welcome. Visit the `Contributing page <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ for more information. Bug reports can be filed on the `dedicated issue page <https://github.com/adbar/trafilatura/issues>`_.

Many thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who extended the docs or submitted bug reports, features and bugfixes!


Context
-------

This work started as a PhD project at the crossroads of linguistics and NLP,
this expertise has been instrumental in shaping Trafilatura over the years.
Initially launched to create text databases for research purposes
at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
this package continues to be maintained but its future development
depends on community support.
this package continues to be maintained but its future depends on community support.

**If you value this software or depend on it for your product, consider
sponsoring it and contributing to its codebase**. Your support will
help maintain and enhance this popular package, ensuring its growth,
robustness, and accessibility for developers and users around the world.
sponsoring it and contributing to its codebase**. Your support
`on GitHub <https://github.com/sponsors/adbar>`_ or `ko-fi.com <https://ko-fi.com/adbarbaresi>`_
will help maintain and enhance this popular package.
Visit the `Contributing page <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_
for more information.

*Trafilatura* is an Italian word for `wire drawing <https://en.wikipedia.org/wiki/Wire_drawing>`_ symbolizing the refinement and conversion process. It is also the way shapes of pasta are formed.

Expand Down Expand Up @@ -225,8 +217,8 @@ Further documentation
usage
tutorials
evaluation
corefunctions
used-by
corefunctions
background

:ref:`genindex`
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# with version specifier
sphinx>=8.1.3
pydata-sphinx-theme>=0.16.0
pydata-sphinx-theme>=0.16.1
docutils>=0.21.2
sphinx-sitemap>=2.6.0

Expand Down
12 changes: 4 additions & 8 deletions docs/sources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,18 +39,18 @@ Corpora
URL lists from corpus linguistic projects can be a starting ground to derive information from, either to recreate existing corpora or to re-crawl the websites and find new content. If the websites do not exist anymore, the links can still be useful as the corresponding web pages can be retrieved from web archives.

- `Sources for the Internet Corpora <http://corpus.leeds.ac.uk/internet.html>`_ of the Leeds Centre for Translation Studies
- `Link data sets <https://www.webcorpora.org/opendata/links/>`_ of the COW project
- `Link data sets <http://www.webcorpora.org/opendata/links/>`_ of the COW project


URL directories
~~~~~~~~~~~~~~~

- `Overview of the Web archiving community <https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community>`_
- `Overview of the Web archiving community <https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community>`_
- `lazynlp list of sources <https://github.com/chiphuyen/lazynlp>`_

DMOZ (now an archive) and Wikipedia work quite well as primary sources:

- `Qualification of URLs extracted from DMOZ and Wikipedia <https://tel.archives-ouvertes.fr/tel-01167309/document#page=189>`_ (PhD thesis section)
- `Qualification of URLs extracted from DMOZ and Wikipedia <https://theses.hal.science/tel-01167309/document#page=189>`_ (PhD thesis section)

..
https://www.sketchengine.eu/guide/create-a-corpus-from-the-web/
Expand Down Expand Up @@ -130,14 +130,10 @@ Social networks

Series of surface scrapers that crawl the networks without even logging in, thus circumventing the API restrictions. Development of such software solutions is fast-paced, so no links will be listed here at the moment.

Previously collected tweet IDs can be “hydrated”, i.e. retrieved from Twitter in bulk. see for instance:

- `Twitter datasets for research and archiving <https://tweetsets.library.gwu.edu/>`_
- `Search GitHub for Tweet IDs <https://github.com/search?q=tweet+ids>`_
Previously collected tweet IDs can be “hydrated”, i.e. retrieved from Twitter in bulk.

Links can be extracted from tweets with a regular expression such as ``re.findall(r'https?://[^ ]+')``. They probably need to be resolved first to get actual link targets and not just shortened URLs (like t.co/…).


For further ideas from previous projects see references below.


Expand Down
4 changes: 2 additions & 2 deletions docs/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Beyond raw HTML

While downloading and processing raw HTML documents is much faster, it can be necessary to fully render the web page before further processing, e.g. because a page makes exhaustive use of JavaScript or because content is injected from multiple sources.

In such cases the way to go is to use a browser automation library like `Playwright <https://playwright.dev/python/docs/library/>`_. For available alternatives see this `list of headless browsers <https://github.com/dhamaniasad/HeadlessBrowsers>`_.
In such cases the way to go is to use a browser automation library like Playwright. For available alternatives see this `list of headless browsers <https://github.com/dhamaniasad/HeadlessBrowsers>`_.

For more refined masking and automation methods, see the `nodriver <https://github.com/ultrafunkamsterdam/nodriver>`_ and `browserforge <https://github.com/daijro/browserforge>`_ packages.

Expand All @@ -43,7 +43,7 @@ For more refined masking and automation methods, see the `nodriver <https://gith
Bypassing paywalls
^^^^^^^^^^^^^^^^^^

A browser automation library can also be useful to bypass issues related to cookies and paywalls as it can be combined with a corresponding browser extension, e.g. `iamdamdev's bypass-paywalls-chrome <https://github.com/iamadamdev/bypass-paywalls-chrome>`_ or `this alternative by magnolia1234 <https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean>`_.
A browser automation library can also be useful to bypass issues related to cookies and paywalls as it can be combined with a corresponding browser extension, e.g. iamdamdev's bypass-paywalls-chrome and available alternatives.



Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial-dwds.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ So wird die DWDS-Plattform zu einer Art Meta-Suchmaschine. Der Vorteil besteht d
Hier finden Sie eine `Liste der Webkorpora auf der DWDS-Plattform <https://www.dwds.de/d/k-web>`_.


Bei größeren Webkorpora ist die Filterung hinsichtlich der Relevanz und der Textqualität meistens quantitativer Natur, siehe `Barbaresi 2015 (Diss.) Kapitel 4 <https://tel.archives-ouvertes.fr/tel-01167309/document>`_ für Details. Im Übrigen haben wir das Schlimmste aus dem Web manuell ausgegrenzt.
Bei größeren Webkorpora ist die Filterung hinsichtlich der Relevanz und der Textqualität meistens quantitativer Natur, siehe `Barbaresi 2015 (Diss.) Kapitel 4 <https://theses.hal.science/tel-01167309/document>`_ für Details. Im Übrigen haben wir das Schlimmste aus dem Web manuell ausgegrenzt.


Download und Verarbeitung der Daten
Expand Down
9 changes: 1 addition & 8 deletions docs/tutorial-epsilla.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,7 @@ Alternatives include `Qdrant <https://github.com/qdrant/qdrant>`_, `Redis <https
Setup Epsilla
-------------

In this tutorial, we will need an Epsilla database server. There are two ways to get one: use the free cloud version or start one locally.

Epsilla has a `cloud version <https://cloud.epsilla.com//?ref=trafilatura>`_ with a free tier. You can sign up and get a server running in a few steps.

Alternatively, you can start one locally with a `Docker <https://docs.docker.com/get-started/>`_ image.
In this tutorial, we will need an Epsilla database server. You can start one locally with a `Docker <https://docs.docker.com/get-started/>`_ image.

.. code-block:: bash
Expand Down Expand Up @@ -155,7 +151,4 @@ We can now perform a vector search to find the most relevant project based on a
You will see the returned response is React! That is the correct answer. React is a modern frontend library, but PyTorch and Tensorflow are not.

.. image:: https://static.scarf.sh/a.png?x-pxid=51f549d1-aabf-473c-b971-f8d9c3ac8ac5
:alt:


2 changes: 1 addition & 1 deletion docs/tutorial0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ The output directory can be created on demand, but it must be writable.
# output in XML format, backup of HTML files
$ trafilatura --xml -i list.txt -o xmlfiles/ --backup-dir htmlfiles/
The second and third instructions create a collection of `XML files <https://en.wikipedia.org/wiki/XML>`_ which can be edited with a basic text editor or a full-fledged text-editing software or IDE such as the `Atom editor <https://atom.io/>`_.
The second and third instructions create a collection of `XML files <https://en.wikipedia.org/wiki/XML>`_ which can be edited with a basic text editor or a full-fledged text-editing software or IDE.


.. hint::
Expand Down
24 changes: 11 additions & 13 deletions docs/usage-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,32 +3,32 @@ API

.. meta::
:description lang=en:
See how to use the official Trafilatura API to download and extract data for free or for larger volumes.
See how to use the official Trafilatura API to download and extract data.


Introduction
------------

Simplify the process of turning URLs and HTML into structured, meaningful data!

Use the last version of the software straight from the application programming interface. The API allows you to access the capabilities of Trafilatura, a web scraping and data extraction library, directly from your applications and projects.

With the Trafilatura API, you can:

- Download URLs or provide your own data, including web scraping capabilities
- Configure the output format to suit your needs, with support for multiple use cases


This is especially useful if you want to try out the software without installing it or if you want to support the project while saving time.


Endpoints
---------
.. warning::
The API is currently unavailable, feel free to get in touch for any inquiries.

The Trafilatura API comes in two versions, available from two different gateways:

- `Free for demonstration purposes <https://trafilatura.mooo.com>`_ (including documentation page)
- `For a larger volume of requests <https://rapidapi.com/trafapi/api/trafilatura>`_ (documentation with code snippets and plans)
..
Endpoints
---------
The Trafilatura API comes in two versions, available from two different gateways:

- `Free for demonstration purposes <https://trafilatura.mooo.com>`_ (including documentation page)
- `For a larger volume of requests <https://rapidapi.com/trafapi/api/trafilatura>`_ (documentation with code snippets and plans)


Making JSON requests
Expand Down Expand Up @@ -103,5 +103,3 @@ Further information
-------------------

Please note that the underlying code is not currently open-sourced, feel free to reach out for specific use cases or collaborations.

With the API, you can focus on building your applications and projects, while leaving the heavy lifting to Trafilatura.
2 changes: 1 addition & 1 deletion docs/usage-gui.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Troubleshooting
Mac OS X:

- ``This program needs access to the screen...`` This problem is related to the way you installed Python or the shell you're running:
1. Clone the repository and start with "python trafilatura_gui/interface.py" (`source <https://docs.python.org/3/using/mac.html#running-scripts-with-a-gui>`_)
1. Clone the repository and start with "python trafilatura_gui/interface.py" (`source <https://docs.python.org/3/using/mac.html>`_)
2. `Configure your virtual environment <https://wiki.wxpython.org/wxPythonVirtualenvOnMac>`_ (Python3 and wxpython 4.1.0)


Expand Down
2 changes: 1 addition & 1 deletion docs/usage-r.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Introduction
------------


R is a free software environment for statistical computing and graphics. `Reticulate <https://rstudio.github.io/reticulate>`_ is an R package that enables easy interoperability between R and Python. With Reticulate, you can import Python modules as if they were R packages and call Python functions from R.
R is a free software environment for statistical computing and graphics. `Reticulate <https://rstudio.github.io/reticulate/>`_ is an R package that enables easy interoperability between R and Python. With Reticulate, you can import Python modules as if they were R packages and call Python functions from R.

This allows R users to leverage the vast array of Python packages and tools and basically allows for execution of Python code inside an R session. Python packages can then be used with minimal adaptations rather than having to go back and forth between languages and environments.

Expand Down
Loading

0 comments on commit 42ada5a

Please sign in to comment.