Skip to content

Commit

Permalink
prepare version 2.0.0 (adbar#759)
Browse files Browse the repository at this point in the history
* prepare version 2.0.0

* update setup and wording

* docs: readme and structure

* update dependabot and funding

* update contributing and history files
  • Loading branch information
adbar authored Dec 3, 2024
1 parent b7bfcc3 commit c6e8340
Show file tree
Hide file tree
Showing 9 changed files with 85 additions and 111 deletions.
2 changes: 1 addition & 1 deletion .github/FUNDING.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# These are supported funding model platforms

github: # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
github: [adbar]
patreon: # Replace with a single Patreon username
open_collective: # Replace with a single Open Collective username
ko_fi: adbarbaresi
Expand Down
18 changes: 0 additions & 18 deletions .github/dependabot.yml

This file was deleted.

26 changes: 14 additions & 12 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,41 @@
## How to contribute

Thank you for considering contributing to Trafilatura! Your contributions make the software and its documentation better.
Your contributions make the software and its documentation better. A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.


There are many ways to contribute, you could:

* Improve the documentation: Write tutorials and guides, correct mistakes, or translate existing content.
* Find bugs and submit bug reports: Help making Trafilatura a robust and versatile tool.
* Find bugs and submit bug reports: Help making Trafilatura an even more robust tool.
* Submit feature requests: Share your feedback and suggestions.
* Write code: Fix bugs or add new features.


Here are some important resources:

* [List of currently open issues](https://github.com/adbar/trafilatura/issues) (no pretention to exhaustivity!)
* [Roadmap and milestones](https://github.com/adbar/trafilatura/milestones)
* [How to Contribute to Open Source](https://opensource.guide/how-to-contribute/)
* [How to contribute to open source](https://opensource.guide/how-to-contribute/)


## Submitting changes
## Testing and evaluating the code

Please send a [GitHub Pull Request to trafilatura](https://github.com/adbar/trafilatura/pull/new/master) with a clear list of what you have done (read more about [pull requests](http://help.github.com/pull-requests/)).
Here is how you can run the tests and code quality checks:

**Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github)
- Install the necessary packages with `pip install trafilatura[dev]`
- Run `pytest` from trafilatura's directory, or select a particular test suite, for example `realworld_tests.py`, and run `pytest realworld_tests.py` or simply `python3 realworld_tests.py`
- Run `mypy` on the directory: `mypy trafilatura/`
- See also the [tests Readme](tests/README.rst) for information on the evaluation benchmark

Pull requests will only be accepted if they there are no errors in pytest and mypy.

A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.
If you work on text extraction it is useful to check if performance is equal or better on the benchmark.


## Testing and evaluating the code
## Submitting changes

Here is how you can run the tests if you wish to correct the errors and further improve the code:
Please send a pull request to Trafilatura with a list of what you have done (read more about [pull requests](http://help.github.com/pull-requests/)).

- Run `pytest` from trafilatura's directory, or select a particular test suite, for example `realworld_tests.py`, and run `pytest realworld_tests.py` or simply `python3 realworld_tests.py`
- See also the [tests Readme](tests/README.rst) for information on the evaluation
**Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github)



Expand Down
12 changes: 10 additions & 2 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
## History / Changelog


## future v2.0.0
## 2.0.0

Breaking changes:
- Python 3.6 and 3.7 deprecated (#709)
Expand All @@ -12,6 +12,7 @@ Breaking changes:
- downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724)
- deprecated graphical user interface now removed (#713)
- extraction: move `max_tree_size` parameter to `settings.cfg` (#742)
- use type hinting (#721, #723, #748)
- see [Python](https://trafilatura.readthedocs.io/en/latest/usage-python.html#deprecations) and [CLI](https://trafilatura.readthedocs.io/en/latest/usage-cli.html#deprecations) deprecations in the docs

Fixes:
Expand All @@ -20,11 +21,16 @@ Fixes:
- more robust mapping for conversion to HTML (#721)
- CLI downloads: use all information in settings file (#734)
- downloads: cleaner urllib3 code (#736)
- CLI: print URLs early for feeds and sitemaps with `--list` with @gremid (#744)
- refine table markdown output by @unsleepy22 (#752)
- extraction fix: images in text nodes by @unsleepy22 (#757)

Metadata:
- more robust URL extraction (#710)

Command-line interface:
- CLI: print URLs early for feeds and sitemaps with `--list` with @gremid (#744)
- CLI: add 126 exit code for high error ratio (#747)

Maintenance:
- remove already deprecated functions and args (#716)
- add type hints (#723, #728)
Expand All @@ -33,10 +39,12 @@ Maintenance:
- better debug messages in `main_extractor` (#714)
- evaluation: review data, update packages, add magic_html (#731)
- setup: explicit exports through `__all__` (#740)
- tests: extend coverage (#753)

Documentation:
- fix link in `docs/index.html` by @nzw0301 (#711)
- remove docs from published packages (#743)
- update docs (#745)


## 1.12.2
Expand Down
65 changes: 22 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,15 +32,16 @@ required, the output can be converted to commonly used formats.

Going from HTML bulk to essential parts can alleviate many problems
related to text quality, by **focusing on the actual content**,
**avoiding the noise** caused by recurring elements (headers, footers
etc.), and **making sense of the data** with selected information. The
extractor is designed to be **robust and reasonably fast**, it runs in
production on millions of documents.
**avoiding the noise** caused by recurring elements like headers and footers
and by **making sense of the data and metadata** with selected information.
The extractor strikes a balance between limiting noise (precision) and
including all valid parts (recall). It is **robust and reasonably fast**.

The tool's versatility makes it **useful for quantitative and
data-driven approaches**. It is used in the academic domain and beyond
(e.g. in natural language processing, computational social science,
search engine optimization, and information security).
Trafilatura is [widely used](https://trafilatura.readthedocs.io/en/latest/used-by.html)
and integrated into [thousands of projects](https://github.com/adbar/trafilatura/network/dependents>)
by companies like HuggingFace, IBM, and Microsoft Research as well as institutions like
the Allen Institute, Stanford, the Tokyo Institute of Technology, and
the University of Munich.


### Features
Expand Down Expand Up @@ -85,22 +86,6 @@ For more information see the [benchmark section](https://trafilatura.readthedocs
and the [evaluation readme](https://github.com/adbar/trafilatura/blob/master/tests/README.rst)
to run the evaluation with the latest data and packages.

**750 documents, 2236 text & 2250 boilerplate segments (2022-05-18), Python 3.8**

| Python Package | Precision | Recall | Accuracy | F-Score | Diff. |
|----------------|-----------|--------|----------|---------|-------|
| html_text 0.5.2 | 0.529 | **0.958** | 0.554 | 0.682 | 2.2x |
| inscriptis 2.2.0 (html to txt) | 0.534 | **0.959** | 0.563 | 0.686 | 3.5x |
| newspaper3k 0.2.8 | 0.895 | 0.593 | 0.762 | 0.713 | 12x |
| justext 3.0.0 (custom) | 0.865 | 0.650 | 0.775 | 0.742 | 5.2x |
| boilerpy3 1.0.6 (article mode) | 0.814 | 0.744 | 0.787 | 0.777 | 4.1x |
| *baseline (text markup)* | 0.757 | 0.827 | 0.781 | 0.790 | **1x** |
| goose3 3.1.9 | **0.934** | 0.690 | 0.821 | 0.793 | 22x |
| readability-lxml 0.8.1 | 0.891 | 0.729 | 0.820 | 0.801 | 5.8x |
| news-please 1.5.22 | 0.898 | 0.734 | 0.826 | 0.808 | 61x |
| readabilipy 0.2.0 | 0.877 | 0.870 | 0.874 | 0.874 | 248x |
| trafilatura 1.2.2 (standard) | 0.914 | 0.904 | **0.910** | **0.909** | 7.1x |


#### Other evaluations:

Expand Down Expand Up @@ -138,7 +123,7 @@ This package is distributed under the [Apache 2.0 license](https://www.apache.or
Versions prior to v1.8.0 are under GPLv3+ license.


## Contributing
### Contributing

Contributions of all kinds are welcome. Visit the [Contributing
page](https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md)
Expand All @@ -152,13 +137,17 @@ who extended the docs or submitted bug reports, features and bugfixes!

## Context

Developed with practical applications of academic research in mind, this
software is part of a broader effort to derive information from web
documents. Extracting and pre-processing web texts to the exacting
standards of scientific research presents a substantial challenge. This
software package simplifies text data collection and enhances corpus
quality, it is currently used to build [text databases for linguistic
research](https://www.dwds.de/d/k-web).
This work started as a PhD project at the crossroads of linguistics and
NLP, this expertise has been instrumental in shaping Trafilatura over
the years. Initially launched to create text databases for research purposes
at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
this package continues to be maintained but its future development
depends on community support.

**If you value this software or depend on it for your product, consider
sponsoring it and contributing to its codebase**. Your support will
help maintain and enhance this popular package, ensuring its growth,
robustness, and accessibility for developers and users around the world.

*Trafilatura* is an Italian word for [wire
drawing](https://en.wikipedia.org/wiki/Wire_drawing) symbolizing the
Expand All @@ -171,11 +160,6 @@ Reach out via ia the software repository or the [contact
page](https://adrien.barbaresi.eu/) for inquiries, collaborations, or
feedback. See also social networks for the latest updates.

This work started as a PhD project at the crossroads of linguistics and
NLP, this expertise has been instrumental in shaping Trafilatura over
the years. It has first been released under its current form in 2019,
its development is referenced in the following publications:

- Barbaresi, A. [Trafilatura: A Web Scraping Library and Command-Line
Tool for Text Discovery and
Extraction](https://aclanthology.org/2021.acl-demo.15/), Proceedings
Expand Down Expand Up @@ -212,18 +196,13 @@ acquisition. Here is how to cite it:

### Software ecosystem

Case studies and publications are listed on the [Used By documentation
page](https://trafilatura.readthedocs.io/en/latest/used-by.html).

Jointly developed plugins and additional packages also contribute to the
field of web data extraction and analysis:

<img alt="Software ecosystem" src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/software-ecosystem.png" align="center" width="65%"/>

Corresponding posts can be found on [Bits of
Language](https://adrien.barbaresi.eu/blog/tag/trafilatura.html). The
blog covers a range of topics from technical how-tos, updates on new
features, to discussions on text mining challenges and solutions.
Language](https://adrien.barbaresi.eu/blog/tag/trafilatura.html).

Impressive, you have reached the end of the page: Thank you for your
interest!
43 changes: 25 additions & 18 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,9 @@ Description

Trafilatura is a **Python package and command-line tool** designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to commonly used formats.

Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the **noise caused by recurring elements** (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to **make sense of the data**. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be **robust and reasonably fast**, it runs in production on millions of documents.
Going from raw HTML to essential parts can alleviate many problems related to text quality, by avoiding the **noise caused by recurring elements** like headers and footers and by **making sense of the data and metadata** with selected information. The extractor strikes a balance between limiting noise (precision) and including all valid parts (recall). It is **robust and reasonably fast**.

This tool can be **useful for quantitative research** in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security.
Trafilatura is `widely used <used-by.html>`_ and integrated into `thousands of projects <https://github.com/adbar/trafilatura/network/dependents>`_ by companies like HuggingFace, IBM, and Microsoft Research as well as institutions like the Allen Institute, Stanford, the Tokyo Institute of Technology, and the University of Munich.


Features
Expand Down Expand Up @@ -120,25 +120,27 @@ Versions prior to v1.8.0 are under GPLv3+ license.


Contributing
------------
~~~~~~~~~~~~

Contributions of all kinds are welcome. Visit the `Contributing page <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ for more information. Bug reports can be filed on the `dedicated issue page <https://github.com/adbar/trafilatura/issues>`_.

Many thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who extended the docs or submitted bug reports, features and bugfixes!


Changes
-------

For version history and changes see the `changelog <https://github.com/adbar/trafilatura/blob/master/HISTORY.md>`_.


Context
-------

Originally released to collect data for linguistic research and lexicography at the `Berlin-Brandenburg Academy of Sciences <https://www.dwds.de/d/k-web>`_, Trafilatura is now `widely used <used-by.html>`_.
This work started as a PhD project at the crossroads of linguistics and NLP,
this expertise has been instrumental in shaping Trafilatura over the years.
Initially launched to create text databases for research purposes
at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
this package continues to be maintained but its future development
depends on community support.

Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge. These documentation pages also provide information on `concepts behind data collection <background.html>`_ as well as `tutorials <tutorials.html>`_ on how to gather web texts.
**If you value this software or depend on it for your product, consider
sponsoring it and contributing to its codebase**. Your support will
help maintain and enhance this popular package, ensuring its growth,
robustness, and accessibility for developers and users around the world.

*Trafilatura* is an Italian word for `wire drawing <https://en.wikipedia.org/wiki/Wire_drawing>`_ symbolizing the refinement and conversion process. It is also the way shapes of pasta are formed.

Expand All @@ -148,9 +150,6 @@ Author

Reach out via the software repository or the `contact page <https://adrien.barbaresi.eu/>`_ for inquiries, collaborations, or feedback. See also social networks for the latest updates.

This work started as a PhD project at the crossroads of linguistics and NLP, this expertise has been instrumental in shaping Trafilatura over the years. It has first been released under its current form in 2019, its development is referenced in the following publications:


- Barbaresi, A. `Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction <https://aclanthology.org/2021.acl-demo.15/>`_, Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131.
- Barbaresi, A. "`Generic Web Content Extraction with Open-Source Software <https://hal.archives-ouvertes.fr/hal-02447264/document>`_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
- Barbaresi, A. "`Efficient construction of metadata-enhanced web corpora <https://hal.archives-ouvertes.fr/hal-01371704v2/document>`_", Proceedings of the `10th Web as Corpus Workshop (WAC-X) <https://www.sigwac.org.uk/wiki/WAC-X>`_, 2016.
Expand Down Expand Up @@ -186,16 +185,17 @@ Trafilatura is widely used in the academic domain, chiefly for data acquisition.
Software ecosystem
~~~~~~~~~~~~~~~~~~

Case studies and publications are listed on the `Used By documentation page <used-by.html>`_.

Jointly developed plugins and additional packages also contribute to the field of web data extraction and analysis:

.. image:: software-ecosystem.png
:alt: Software ecosystem
:align: center
:width: 65%

Corresponding posts on `Bits of Language <https://adrien.barbaresi.eu/blog/tag/trafilatura.html>`_ (blog).
Corresponding posts can be found on
`Bits of Language <https://adrien.barbaresi.eu/blog/tag/trafilatura.html>`_.
The blog covers a range of topics from technical how-tos, updates on new
features, to discussions on text mining challenges and solutions.


Building the docs
Expand All @@ -208,6 +208,13 @@ Starting from the ``docs/`` folder of the repository:



Changes
-------

For version history and changes see the `changelog <https://github.com/adbar/trafilatura/blob/master/HISTORY.md>`_.



Further documentation
=====================

Expand All @@ -222,4 +229,4 @@ Further documentation
used-by
background

* :ref:`genindex`
:ref:`genindex`
Loading

0 comments on commit c6e8340

Please sign in to comment.