Skip to content

Latest commit

 

History

History
220 lines (158 loc) · 6.55 KB

HISTORY.md

File metadata and controls

220 lines (158 loc) · 6.55 KB

History / Changelog

1.2.2

  • more efficient rules for extraction
  • metadata: further attributes used (with @felipehertzer)
  • better baseline extraction
  • issues fixed: #202, #204, #205
  • evaluation updated

1.2.1

  • --precision and --recall arguments added to the CLI
  • better text cleaning: paywalls and comments
  • improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
  • further bugs fixed: #189, #192 (with @felipehertzer), #200
  • efficiency: faster module loading and improved RAM footprint

1.2.0

  • efficiency: replaced module readability-lxml by trimmed fork
  • bug fixed: (#179, #180, #183, #184)
  • improved baseline extraction
  • cleaner metadata (with @felipehertzer)

1.1.0

  • encodings: better detection, output NFC-normalized Unicode
  • maintenance and performance: more efficient code
  • bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)
  • prepare compatibility with upcoming Python 3.11
  • changed default settings
  • extended documentation

1.0.0

  • compress HTML backup files & seamlessly open .gz files
  • support JSON web feeds
  • graphical user interface integrated into main package
  • faster downloads: reviewed backoff, compressed data
  • optional modules: downloads with pycurl, language identification with py3langid
  • bugs fixed (#111, #125, #132, #136, #140)
  • minor optimizations and fixes by @vbarbaresi in #124 & #130
  • fixed array with single or multiples entries on json extractor by @felipehertzer in #143
  • code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
  • drop support for Python 3.5

0.9.3

  • better, faster encoding detection: replaced chardet with charset_normalizer
  • faster execution: updated justext to 3.0
  • better extraction of sub-elements in tables (#78, #90)
  • more robust web feed parsing
  • further defined precision- and recall-oriented settings
  • license extraction in footers (#118)

0.9.2

  • first precision- and recall-oriented presets defined
  • improvements in authorship extraction (thanks @felipehertzer)
  • requesting TXT output with formatting now results in Markdown format
  • bugs fixed: notably extraction robustness and consistency (#109, #111, #113)
  • setting for cookies in request headers (thanks @muellermartin)
  • better date extraction thanks to htmldate update

0.9.1

  • improved author extraction (thanks @felipehertzer!)
  • bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ...
  • docs updated and extended
  • CLI: option names normalized (heed deprecation warnings), new option explore

0.9.0

  • focused crawling functions including politeness rules
  • more efficient multi-threaded downloads + use as Python functions
  • documentation extended
  • bugs fixed: extraction and URL handling
  • removed support for Python 3.4

0.8.2

  • better handling of formatting, links and images, title type as attribute in XML formats
  • more robust sitemaps and feeds processing
  • more accurate extraction
  • further consolidation: code simplified and bugs fixed

0.8.1

  • extraction trade-off: slightly better recall
  • code robustness: requests, configuration and navigation
  • bugfixes: image data extraction

0.8.0

  • improved link discovery and handling
  • fixes in metadata extraction, feeds and sitemaps processing
  • breaking change: the extract function now reads target format from output_format argument only
  • new extraction option: preserve links, CLI options re-ordered
  • more opportunistic backup extraction

0.7.0

  • customizable configuration file to parametrize extraction and downloads
  • better handling of feeds and sitemaps
  • additional CLI options: crytographic hash for file name, use Internet Archive as backup
  • more precise extraction
  • faster downloads: requests replaced with bare urllib3 and custom decoding
  • consolidation: bug fixes and improvements, many thanks to the issues reporters!

0.6.1

  • added bare_extraction function returning Python variables
  • improved link discovery in feeds and sitemaps
  • option to preserve image info
  • fixes (many thanks to bug reporters!)

0.6.0

  • link discovery in sitemaps
  • compatibility with Python 3.9
  • extraction coverage improved
  • deduplication now optional
  • bug fixes

0.5.2

  • optional language detector changed: langidpycld3
  • helper function bare_extraction()
  • optional deduplication off by default
  • better URL handling (courlan), more complete metadata
  • code consolidation (cleaner and shorter)

0.5.1

  • extended and more convenient command-line options
  • output in JSON format
  • bug fixes

0.5.0

  • faster and more robust text and metadata extraction
  • more efficient batch processing (parallel processing, URL queues)
  • extraction and processing of ATOM/RSS feeds
  • complete command-line tool with corresponding options

0.4.1

  • better metadata extraction and integration (XML & XML-TEI)
  • more efficient processing
  • output directory as CLI-option

0.4

  • improved "fast" mode (accuracy and speed)
  • better fallbacks with readability-lxml and justext
  • metadata extraction added
  • more robust processing (tests, encoding handling)

0.3.1

  • support for Python 3.4 reactivated
  • bugs in XML output and discarding sections solved
  • new tests and documentation

0.3.0

  • code base re-structured for clarity and readability
  • streamlined HTML processing and conversion
  • internal less-recently-used cache (LRU) for deduplication
  • export as CSV
  • better test coverage, extraction recall and precision
  • further documentation (trafilatura.readthedocs.org)
  • optional processing of text formatting
  • more complete settings file

0.2.1

  • added metadata to the XML output
  • production of valid XML TEI for simple documents

0.2.0

  • better handling of nested elements, quotes and tables
  • validation of XML TEI documents
  • bulk download and processing

0.1.1

  • handling of line breaks
  • element trimming simplified

0.1.0

  • first release used in production and meant to be archived for reproducibility and citability
  • better extraction precision

0.0.5: last version compatible with Python 3.4

  • optional dependencies
  • bugs in parsing removed

0.0.4

  • code profiling and speed-up

0.0.3

  • tables included in extraction
  • bypass justext in arguments
  • better handling of non-p elements

0.0.2

  • better handling of text nodes
  • improvements in extraction recall

0.0.1

  • first release, minimum viable package