From refs/pull/103/merge e54a92f

hynky1999 · Feb 14, 2024 · 2d5cbc0 · 2d5cbc0
commit 2d5cbc0
Show file tree

Hide file tree

Showing 162 changed files with 18,516 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: 2921a4c5ce816b9792230abd536acabd
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.nojekyll b/.nojekyll
diff --git a/README.md b/README.md
@@ -0,0 +1,3 @@
+#GitHub Pages
+
+Last update of sphinx html documentation from [e54a92f](https://github.com/hynky1999/CmonCrawl/tree/e54a92f01e3b4b491b1ff54b4560467f57a318b0)
diff --git a/_images/when_to_use.drawio.png b/_images/when_to_use.drawio.png
diff --git a/_sources/api.rst.txt b/_sources/api.rst.txt
@@ -0,0 +1,14 @@
+API
+===
+
+.. autosummary::
+    :recursive:
+    :toctree: generated
+
+
+    cmoncrawl
+
+
+
+
+
diff --git a/_sources/cli/cli.rst.txt b/_sources/cli/cli.rst.txt
@@ -0,0 +1,48 @@
+.. _cli:
+
+Command Line Interface
+======================
+
+The command line interface is a simple wrapper around the library.
+
+It provides the two main functionalities:
+
+* `download` - Downloads samples of either :ref:`domain_record` or HTML from common crawl indexes
+* `extract` - Downloads an HTML from Domain Record and extracts the content. It can also directly take the HTML and extract the data.
+
+Both functionalities are invoked using ``cmon`` followed by the functionality and the required arguments.
+The ``cmon`` command also takes a few optional arguments:
+
+--verbosity
+   Verbosity level. Choices are [0, 1, 2], with 0 being the least verbose and 2 being the most verbose. Default is 1.
+
+--aws_profile
+   AWS profile to use for AWS calls (Athena, S3). If not provided, the default AWS profile will be used.
+
+Examples
+--------
+
+.. code-block:: bash
+
+    # Download first 1000 domain records for example.com
+    cmon download --match_type=domain --limit=1000 dr_output record example.com
+
+    # Download first 100 htmls for example.com
+    cmon download --match_type=domain --limit=100 html_output html example.com
+
+    # Take the domain records downloaded using the first command and extracts them using your extractors
+    cmon extract config.json extracted_output dr_output/*.jsonl record
+
+    # Take the htmls downloaded using the second command and extracts them using your extractors
+    cmon extract config.json extracted_output html_output/*.html html
+
+
+
+
+
+
+
+
+
+
+
diff --git a/_sources/cli/download.rst.txt b/_sources/cli/download.rst.txt
@@ -0,0 +1,105 @@
+Command Line Download
+=====================
+
+The download mode of the ``cmon`` command line tool serves to query and download from CommonCrawl indexes.
+The following arguments are needed in this order:
+
+Positional arguments
+--------------------
+
+1. output - Path to output directory.
+
+2. {record,html} - Download mode:
+
+   - record: Download record files from Common Crawl.
+   - html: Download HTML files from Common Crawl.
+
+3. urls - URLs to download, e.g. www.bcc.cz.
+
+
+In html mode, the output directory will contain .html files, one
+for each found URL. In record mode, the output directory will contain
+``.jsonl`` files, each containing multiple domain records in JSON format.
+
+
+Options
+-------
+
+--limit LIMIT
+   Max number of URLs to download.
+
+--since SINCE
+   Start date in ISO format (e.g., 2020-01-01).
+
+--to TO
+   End date in ISO format (e.g., 2020-01-01).
+
+--cc_server CC_SERVER
+   Common Crawl indexes to query. Must provide the whole URL (e.g., https://index.commoncrawl.org/CC-MAIN-2023-14-index).
+
+--max_retry MAX_RETRY
+   Max number of retries for a request. Increase this number when requests are failing.
+
+--sleep_base SLEEP_BASE
+   Base sleep time for exponential backoff in case of request failure.
+
+--max_requests_per_second MAX_REQUESTS_PER_SECOND
+   Max number of requests per second.
+
+--match_type MATCH_TYPE
+   One of exact, prefix, host, domain
+   Match type for the URL. Refer to cdx-api for more information.
+   See :py:class:`cmoncrawl.common.types.MatchType` for more information.
+
+--max_directory_size MAX_DIRECTORY_SIZE
+   Max number of files per directory.
+
+--filter_non_200
+   Filter out non-200 status code.
+
+--aggregator AGGREGATOR
+   Aggregator to use for the query.
+
+   - athena: Athena aggregator. Fastest, but requires AWS credentials with correct permissions. See :ref:`misc/athena:Athena` for more information.
+   - gateway: Gateway aggregator (default). Very slow, but no need for AWS config.
+
+--s3_bucket S3_BUCKET
+   S3 bucket to use for Athena aggregator. Only needed if using Athena aggregator.
+
+   - If set the bucket will not be deleted after the query is done, allowing to reuse it for future queries.
+   - If not set, a temporary bucket will be created and deleted after the query is done.
+
+.. note::
+   If you specify an S3 bucket, remember to delete it manually after you're done to avoid incurring unnecessary costs.
+
+
+Record mode options
+-------------------
+
+--max_crawls_per_file MAX_CRAWLS_PER_FILE
+    Max number of domain records per file output
+
+HTML mode options
+-----------------
+
+--encoding ENCODING
+   Force usage of specified encoding if possible.
+
+--download_method DOWNLOAD_METHOD
+   Method for downloading warc files from Common Crawl, it only applies to HTML download.
+
+   - api: Download from Common Crawl API Gateway. This is the default option.
+   - s3: Download from Common Crawl S3 bucket. This is the fastest option, but requires AWS credentials with correct permissions.
+
+
+Examples
+--------
+
+
+.. code-block:: bash
+
+    # Download first 1000 domain records for example.com
+    cmon download dr_output record --match_type=domain --limit=1000 example.com
+
+    # Download first 100 htmls for example.com
+    cmon download html_output html --match_type=domain --limit=100 example.com
diff --git a/_sources/cli/extract.rst.txt b/_sources/cli/extract.rst.txt
@@ -0,0 +1,90 @@
+Command line Extract
+====================
+
+The extract mode of the ``cmon`` command line tool serves to extract data from your downloaded files.
+The following arguments are needed in this order:
+
+Positional arguments
+--------------------
+
+1. config_path - Path to the config file containing extraction rules.
+
+2. output_path - Path to the output directory.
+
+3. {record,html} - Extraction mode:
+
+   - record: Extract data from jsonl (domain record) files.
+   - html: Extract data from HTML files.
+
+4. files - Files to extract data from (Either HTML files or .jsonl files).
+
+To create a config file, see :ref:`extractor_config`.
+
+Both modes yield the same output format, which is a ``.jsonl`` file containing the extracted data,
+one per line. For each file, a new directory is created in the output directory, named after the
+file.
+
+The files created by the download mode can be directly used with the appropriate mode
+in the extraction.
+
+- If you have an HTML file, you can use the HTML mode to extract it.
+- If you have a domain records, you can use the RECORD mode to extract it.
+- If you have domain records, which you acquired without using cmoncrawl, 
+
+please refer to :ref:`domain_record_jsonl`, which describes how to create ``.jsonl`` files from your domain records,
+which you can then use with the record mode.
+
+Optional arguments
+------------------
+
+--max_crawls_per_file MAX_CRAWLS_PER_FILE
+   Max number of extractions per file output.
+
+--max_directory_size MAX_DIRECTORY_SIZE
+   Max number of extraction files per directory.
+
+--n_proc N_PROC
+   Number of processes to use for extraction. The parallelization is on file level,
+   thus for a single file, it's useless to use more than one process.
+
+Record arguments
+----------------
+
+--max_retry MAX_RETRY
+   Max number of WARC download attempts.
+
+--download_method DOWNLOAD_METHOD
+   Method for downloading warc files from Common Crawl, it only applies to HTML download.
+
+   - api: Download from Common Crawl API Gateway. This is the default option.
+   - s3: Download from Common Crawl S3 bucket. This is the fastest option, but requires AWS credentials with correct permissions.
+
+--sleep_base SLEEP_BASE
+   Base value for exponential backoff between failed requests.
+
+--max_requests_per_second MAX_REQUESTS_PER_SECOND
+   Max number of requests per second.
+
+Html arguments
+--------------
+
+--date DATE
+   Date of extraction of HTML files in ISO format (e.g., 2021-01-01). The default is today.
+
+--url URL
+   URL from which the HTML files were downloaded. By default, it will try to infer from the file content.
+
+Examples
+--------
+
+.. code-block:: bash
+
+    # Take the domain records downloaded using the first command and extracts them using your extractors
+    cmon extract config.json extracted_output dr_output/*.jsonl record --max_retry 100 --download_method=gateway --sleep_base 1.3 
+
+    # Take the htmls downloaded using the second command and extracts them using your extractors
+    cmon extract config.json extracted_output html_output/*.html html --date 2021-01-01 --url https://www.example.com
+
+When you are going to build the extractors, you will appreciate that you can specify
+what the URL of the HTML file is and what the date of the extraction is. This is because 
+those information are used during the extractor routing.
diff --git a/_sources/cli/index.rst.txt b/_sources/cli/index.rst.txt
@@ -0,0 +1,12 @@
+Command Line Interface
+======================
+
+.. toctree::
+    :maxdepth: 2
+    :caption: Contents:
+
+    cli
+    download
+    extract
+
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		#GitHub Pages

		Last update of sphinx html documentation from [e54a92f](https://github.com/hynky1999/CmonCrawl/tree/e54a92f01e3b4b491b1ff54b4560467f57a318b0)