Skip to content

Commit

Permalink
feat(cli): --filename-template and --max-length
Browse files Browse the repository at this point in the history
Introduces two new CLI arguments to allow fine-grained control over how output file paths are generated:

--filename-template: Specify a template string using variables like {domain}, {hash}, {ext} to define a custom directory structure and file naming scheme

--max-length: Set a maximum character limit for generated file paths, intelligently truncating if needed while preserving essential components

Includes documentation updates covering the new options, examples, and troubleshooting.

Closes adbar#754
  • Loading branch information
AdamQuadmon committed Dec 6, 2024
1 parent 76200b7 commit c44c7b5
Show file tree
Hide file tree
Showing 11 changed files with 911 additions and 28 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,11 @@ the University of Munich.
- JSON
- HTML, XML and [XML-TEI](https://tei-c.org/)

- Flexible output file naming:
- Template-based filename generation with variables like {domain}, {path}, {hash}
- Path length control and automatic truncation
- Safe character handling and URL component parsing

- Optional add-ons:
- Language detection on extracted content
- Speed optimizations
Expand All @@ -74,7 +79,6 @@ the University of Munich.
- Regular updates, feature additions, and optimizations
- Comprehensive documentation


### Evaluation and alternatives

Trafilatura consistently outperforms other open-source libraries in text
Expand Down
6 changes: 6 additions & 0 deletions docs/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,13 @@ Extraction options are also available on the command-line and they can be combin
$ < myfile.html trafilatura --json --no-tables
Use ``--filename-template`` to control how output filenames are generated based on the URL and content.

.. code-block:: bash
$ trafilatura -u "https://example.com/path/dirs" --filename-template "{domain}/{path_dirs}/{hash}.{ext}" --markdown -o output/
this will produce a file named ``example.com/path/dirs/uOHdo6wKo4IK0pkL.md`` in the ``output`` directory.

Further steps
-------------
Expand Down
11 changes: 11 additions & 0 deletions docs/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,17 @@ Using a custom file on the command-line
With the ``--config-file`` option, followed by the file name or path. All the required variables have to be present in the custom file.


Filename Generation
^^^^^^^^^^^^^^^^^^^^^
Two new options allow customizing how output filenames are generated:

--filename-template: Specify a template string for generating filenames, using variables like {domain}, {path}, {hash}, {ext}, etc. Example: --filename-template "{domain}/{hash}.{ext}"
--max-length: Set the maximum total path length, including directory components. The default is 250 characters. Example: --max-length 200

The filename template can include directory separators to preserve parts of the original URL's path structure. Unsafe characters are sanitized automatically. If the total path would exceed max-length, it is intelligently truncated while preserving key components.
Invalid variables or unsafe characters will raise an error.


Adapting settings in Python
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
10 changes: 10 additions & 0 deletions docs/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -101,3 +101,13 @@ Download first and extract later
Since the they have distinct characteristics it can be useful to separate the infrastructure needed for download from the extraction. Using a custom IP or network infrastructure can also prevent your usual IP from getting banned.

For an approach using files from the Common Crawl and Trafilatura, see the external tool `datatrove/process_common_crawl_dump.py <https://github.com/huggingface/datatrove/blob/main/examples/process_common_crawl_dump.py>`_.


Invalid template variables and filenames
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you see an error about invalid template variables, check that your ``--filename-template`` string only uses supported values like ``{domain}``, ``{hash}``, etc.
Refer to the filename.py source for a complete list.

An error about unsafe characters in the filename template means that characters like ``<>``, ``:``, ``"`` were used outside of ``{variable}`` sections.
Make sure to only use alphanumeric characters, underscores, dashes and forward slashes in static parts of the template.
38 changes: 35 additions & 3 deletions docs/usage-cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,33 @@ Output as TXT without metadata is the default, another format can be selected in
*HTML output is available from version 1.11, Markdown from version 1.9 onwards.*


Filename Customization
~~~~~~~~~~~~~~~~~~~~~~

Use ``--filename-template`` to control how output filenames are generated based on the URL and content. Supported variables:

- {domain}: Website domain
- {path}: URL path segments, joined by underscores
- {path_dirs}: URL path segments, joined by directory separators
- {params}: URL query parameters
- {hash}: Hash of extracted content
- {ext}: File extension
- {lang}: Identified language

Example: ``--filename-template "{domain}/{hash}.{ext}"``

Use ``--max-length`` to set the maximum total path length, including any directories. It defaults to 250 characters.

If the generated path would exceed this limit, it is intelligently truncated:
1. Individual directory and file components are preserved as long as possible.
2. The file component is reduced to a minimum of {hash}.{ext}.
3. The --output-dir is omitted from length calculations.

Example: ``--max-length 200``

Invalid template variables or unsafe path characters will raise an error.


Optimizing for precision and recall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -166,7 +193,7 @@ Two major command line arguments are necessary here:
.. hint::
Backup of HTML sources can be useful for archival and further processing:

``$ trafilatura --input-file links.txt --output-dir converted/ --backup-dir html-sources/ --xml``


Expand Down Expand Up @@ -288,14 +315,15 @@ For all usage instructions see ``trafilatura -h``:
trafilatura [-h] [-i INPUTFILE | --input-dir INPUTDIR | -u URL]
[--parallel PARALLEL] [-b BLACKLIST] [--list]
[-o OUTPUTDIR] [--backup-dir BACKUP_DIR] [--keep-dirs]
[--filename-template FILENAME_TEMPLATE] [--max-length MAX_LENGTH]
[--feed [FEED] | --sitemap [SITEMAP] | --crawl [CRAWL] |
--explore [EXPLORE] | --probe [PROBE]] [--archived]
[--url-filter URL_FILTER [URL_FILTER ...]] [-f]
[--formatting] [--links] [--images] [--no-comments]
[--no-tables] [--only-with-metadata] [--with-metadata]
[--target-language TARGET_LANGUAGE] [--deduplicate]
[--config-file CONFIG_FILE] [--precision] [--recall]
[--output-format {csv,json,html,markdown,txt,xml,xmltei} |
[--output-format {csv,json,html,markdown,txt,xml,xmltei} |
--csv | --html | --json | --markdown | --xml | --xmltei]
[--validate-tei] [-v] [--version]
Expand Down Expand Up @@ -331,6 +359,11 @@ Output:
preserve a copy of downloaded files in a backup
directory
--keep-dirs keep input directory structure and file names
--filename-template FILENAME_TEMPLATE
template for generating filenames (e.g.
{domain}/{path}-{hash}.{ext})
--max-length MAX_LENGTH
maximum length for generated file paths
Navigation:
Link discovery and web crawling
Expand Down Expand Up @@ -381,4 +414,3 @@ Format:
--xml shorthand for XML output
--xmltei shorthand for XML TEI output
--validate-tei validate XML TEI output
64 changes: 64 additions & 0 deletions tests/cli_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

from trafilatura import cli, cli_utils, spider, settings
from trafilatura.downloads import add_to_compressed_dict, fetch_url
from trafilatura.filename import generate_hash_filename
from trafilatura.utils import LANGID_FLAG

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
Expand Down Expand Up @@ -586,6 +587,67 @@ def test_probing():
else:
assert f.getvalue().strip() == url

def test_filename_template_cli_integration():
"""Test CLI integration with FilenameTemplate."""
# Test hierarchical structure with no extension
testargs = ["", "--filename-template", "{domain}/{path_dirs}", "--output-dir", "/tmp/test", "-u", "https://example.com/blog/post1"]
with patch.object(sys, "argv", testargs):
args = cli.parse_args(testargs)

output_path, destination_dir = cli_utils.determine_output_path(args=args, orig_filename="", content="Test content 1")
assert destination_dir == "/tmp/test/example.com/blog"
assert output_path == "/tmp/test/example.com/blog/post1"

# Test with markdown extension
testargs = ["", "--filename-template", "{domain}/{path_dirs}.{ext}", "--output-dir", "/tmp/test", "--markdown", "-u", "https://example.com/blog/post1"]
with patch.object(sys, "argv", testargs):
args = cli.parse_args(testargs)

output_path2, destination_dir2 = cli_utils.determine_output_path(args=args, orig_filename="", content="Test content 1")
assert destination_dir2 == "/tmp/test/example.com/blog"
assert output_path2 == "/tmp/test/example.com/blog/post1.md"

# Test flattened structure
testargs = ["", "--filename-template", "{domain}/{path}", "--output-dir", "/tmp/test", "-u", "https://example.com/articles/tech/news"]
with patch.object(sys, "argv", testargs):
args = cli.parse_args(testargs)

output_path3, destination_dir3 = cli_utils.determine_output_path(args=args, orig_filename="", content="Test content 2")
assert destination_dir3 == "/tmp/test/example.com"
assert output_path3 == "/tmp/test/example.com/articles_tech_news"

# Test with parameters
testargs = ["", "--filename-template", "{domain}/{path_dirs}/{hash}-{params}", "--output-dir", "/tmp/test", "-u", "https://example.com/articles/tech?id=123&cat=news"]
with patch.object(sys, "argv", testargs):
args = cli.parse_args(testargs)

output_path4, destination_dir4 = cli_utils.determine_output_path(args=args, orig_filename="", content="Test content 3")
assert destination_dir4 == "/tmp/test/example.com/articles/tech"
assert output_path4 == f"/tmp/test/example.com/articles/tech/{generate_hash_filename('Test content 3')}-cat-news_id-123"

@pytest.mark.usefixtures("caplog")
def test_filename_template_cli_errors(caplog):
"""Test error handling in CLI filename template integration."""
# Test URL too long
testargs = ["", "--filename-template", "{domain}/{path_dirs}", "--output-dir", "/tmp/test", "-u", "https://example.com/" + "a" * 100, "--max-length", "100"]
with patch.object(sys, "argv", testargs):
args = cli.parse_args(testargs)

output_path, destination_dir = cli_utils.determine_output_path(args=args, orig_filename="", content="test content")
assert "_ttt_" in output_path
assert destination_dir == "/tmp/test/example.com"
assert generate_hash_filename("test content") in output_path

# Test no URL
testargs = ["", "--filename-template", "{domain}/{path}", "--output-dir", "/tmp/test"]
with patch.object(sys, "argv", testargs):
args = cli.parse_args(testargs)

caplog.set_level(logging.WARNING)
output_path2, destination_dir2 = cli_utils.determine_output_path(args=args, orig_filename="", content="test content")
assert "Template generation failed: URL is required for template variables" in caplog.text
assert output_path2 == "/tmp/test"
assert generate_hash_filename("test content") in destination_dir2

if __name__ == "__main__":
test_parser()
Expand All @@ -599,3 +661,5 @@ def test_probing():
test_crawling()
test_download()
test_probing()
test_filename_template_cli_integration()
test_filename_template_cli_errors()
2 changes: 1 addition & 1 deletion tests/deduplication_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@
import trafilatura.deduplication

from trafilatura import extract
from trafilatura.cli_utils import generate_hash_filename
from trafilatura.core import Extractor
from trafilatura.deduplication import (LRUCache, Simhash, content_fingerprint,
duplicate_test)
from trafilatura.filename import generate_hash_filename


DEFAULT_OPTIONS = Extractor()
Expand Down
Loading

0 comments on commit c44c7b5

Please sign in to comment.