Skip to content

Commit

Permalink
From refs/heads/main 2ee0958
Browse files Browse the repository at this point in the history
  • Loading branch information
hynky1999 committed Nov 19, 2023
0 parents commit 1712697
Show file tree
Hide file tree
Showing 162 changed files with 18,504 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 2921a4c5ce816b9792230abd536acabd
tags: 645f666f9bcd5a90fca523b33c5a78b7
Empty file added .nojekyll
Empty file.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#GitHub Pages

Last update of sphinx html documentation from [2ee0958](https://github.com/hynky1999/CmonCrawl/tree/2ee095824d2cebe0b205af0b77afdb243a0d1aab)
Binary file added _images/when_to_use.drawio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions _sources/api.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
API
===

.. autosummary::
:recursive:
:toctree: generated


cmoncrawl





48 changes: 48 additions & 0 deletions _sources/cli/cli.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
.. _cli:

Command Line Interface
======================

The command line interface is a simple wrapper around the library.

It provides the two main functionalities:

* `download` - Downloads samples of either :ref:`domain_record` or HTML from common crawl indexes
* `extract` - Downloads an HTML from Domain Record and extracts the content. It can also directly take the HTML and extract the data.

Both functionalities are invoked using ``cmon`` followed by the functionality and the required arguments.
The ``cmon`` command also takes a few optional arguments:

--verbosity
Verbosity level. Choices are [0, 1, 2], with 0 being the least verbose and 2 being the most verbose. Default is 1.

--aws_profile
AWS profile to use for AWS calls (Athena, S3). If not provided, the default AWS profile will be used.

Examples
--------

.. code-block:: bash
# Download first 1000 domain records for example.com
cmon download --match_type=domain --limit=1000 dr_output record example.com
# Download first 100 htmls for example.com
cmon download --match_type=domain --limit=100 html_output html example.com
# Take the domain records downloaded using the first command and extracts them using your extractors
cmon extract config.json extracted_output record dr_output/*.jsonl
# Take the htmls downloaded using the second command and extracts them using your extractors
cmon extract config.json extracted_output html html_output/*.html
102 changes: 102 additions & 0 deletions _sources/cli/download.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
Command Line Download
=====================

The download mode of the ``cmon`` command line tool serves to query and download from CommonCrawl indexes.
The following arguments are needed in this order:

Positional arguments
--------------------

1. output - Path to output directory.

2. {record,html} - Download mode:

- record: Download record files from Common Crawl.
- html: Download HTML files from Common Crawl.

3. urls - URLs to download, e.g. www.bcc.cz.


In html mode, the output directory will contain .html files, one
for each found URL. In record mode, the output directory will contain
``.jsonl`` files, each containing multiple domain records in JSON format.


Options
-------

--limit LIMIT
Max number of URLs to download.

--since SINCE
Start date in ISO format (e.g., 2020-01-01).

--to TO
End date in ISO format (e.g., 2020-01-01).

--cc_server CC_SERVER
Common Crawl indexes to query. Must provide the whole URL (e.g., https://index.commoncrawl.org/CC-MAIN-2023-14-index).

--max_retry MAX_RETRY
Max number of retries for a request. Increase this number when requests are failing.

--sleep_base SLEEP_BASE
Base sleep time for exponential backoff in case of request failure.

--match_type MATCH_TYPE
One of exact, prefix, host, domain
Match type for the URL. Refer to cdx-api for more information.
See :py:class:`cmoncrawl.common.types.MatchType` for more information.

--max_directory_size MAX_DIRECTORY_SIZE
Max number of files per directory.

--filter_non_200
Filter out non-200 status code.

--aggregator AGGREGATOR
Aggregator to use for the query.

- athena: Athena aggregator. Fastest, but requires AWS credentials with correct permissions. See :ref:`misc/athena:Athena` for more information.
- gateway: Gateway aggregator (default). Very slow, but no need for AWS config.

--s3_bucket S3_BUCKET
S3 bucket to use for Athena aggregator. Only needed if using Athena aggregator.

- If set the bucket will not be deleted after the query is done, allowing to reuse it for future queries.
- If not set, a temporary bucket will be created and deleted after the query is done.

.. note::
If you specify an S3 bucket, remember to delete it manually after you're done to avoid incurring unnecessary costs.


Record mode options
-------------------

--max_crawls_per_file MAX_CRAWLS_PER_FILE
Max number of domain records per file output

HTML mode options
-----------------

--encoding ENCODING
Force usage of specified encoding if possible.

--download_method DOWNLOAD_METHOD
Method for downloading warc files from Common Crawl, it only applies to HTML download.

- api: Download from Common Crawl API Gateway. This is the default option.
- s3: Download from Common Crawl S3 bucket. This is the fastest option, but requires AWS credentials with correct permissions.


Examples
--------


.. code-block:: bash
# Download first 1000 domain records for example.com
cmon download dr_output record --match_type=domain --limit=1000 example.com
# Download first 100 htmls for example.com
cmon download html_output html --match_type=domain --limit=100 example.com
87 changes: 87 additions & 0 deletions _sources/cli/extract.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
Command line Extract
====================

The extract mode of the ``cmon`` command line tool serves to extract data from your downloaded files.
The following arguments are needed in this order:

Positional arguments
--------------------

1. config_path - Path to the config file containing extraction rules.

2. output_path - Path to the output directory.

3. {record,html} - Extraction mode:

- record: Extract data from jsonl (domain record) files.
- html: Extract data from HTML files.

4. files - Files to extract data from (Either HTML files or .jsonl files).

To create a config file, see :ref:`extractor_config`.

Both modes yield the same output format, which is a ``.jsonl`` file containing the extracted data,
one per line. For each file, a new directory is created in the output directory, named after the
file.

The files created by the download mode can be directly used with the appropriate mode
in the extraction.

- If you have an HTML file, you can use the HTML mode to extract it.
- If you have a domain records, you can use the RECORD mode to extract it.
- If you have domain records, which you acquired without using cmoncrawl,

please refer to :ref:`domain_record_jsonl`, which describes how to create ``.jsonl`` files from your domain records,
which you can then use with the record mode.

Optional arguments
------------------

--max_crawls_per_file MAX_CRAWLS_PER_FILE
Max number of extractions per file output.

--max_directory_size MAX_DIRECTORY_SIZE
Max number of extraction files per directory.

--n_proc N_PROC
Number of processes to use for extraction. The parallelization is on file level,
thus for a single file, it's useless to use more than one process.

Record arguments
----------------

--max_retry MAX_RETRY
Max number of WARC download attempts.

--download_method DOWNLOAD_METHOD
Method for downloading warc files from Common Crawl, it only applies to HTML download.

- api: Download from Common Crawl API Gateway. This is the default option.
- s3: Download from Common Crawl S3 bucket. This is the fastest option, but requires AWS credentials with correct permissions.

--sleep_base SLEEP_BASE
Base value for exponential backoff between failed requests.

Html arguments
--------------

--date DATE
Date of extraction of HTML files in ISO format (e.g., 2021-01-01). The default is today.

--url URL
URL from which the HTML files were downloaded. By default, it will try to infer from the file content.

Examples
--------

.. code-block:: bash
# Take the domain records downloaded using the first command and extracts them using your extractors
cmon extract config.json extracted_output record --max_retry 100 --download_method=gateway --sleep_base 1.3 dr_output/*.jsonl
# Take the htmls downloaded using the second command and extracts them using your extractors
cmon extract config.json extracted_output html --date 2021-01-01 --url https://www.example.com html_output/*.html
When you are going to build the extractors, you will appreciate that you can specify
what the URL of the HTML file is and what the date of the extraction is. This is because
those information are used during the extractor routing.
12 changes: 12 additions & 0 deletions _sources/cli/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Command Line Interface
======================

.. toctree::
:maxdepth: 2
:caption: Contents:

cli
download
extract


Loading

0 comments on commit 1712697

Please sign in to comment.