-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 2d5cbc0
Showing
162 changed files
with
18,516 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 2921a4c5ce816b9792230abd536acabd | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
#GitHub Pages | ||
|
||
Last update of sphinx html documentation from [e54a92f](https://github.com/hynky1999/CmonCrawl/tree/e54a92f01e3b4b491b1ff54b4560467f57a318b0) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
API | ||
=== | ||
|
||
.. autosummary:: | ||
:recursive: | ||
:toctree: generated | ||
|
||
|
||
cmoncrawl | ||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
.. _cli: | ||
|
||
Command Line Interface | ||
====================== | ||
|
||
The command line interface is a simple wrapper around the library. | ||
|
||
It provides the two main functionalities: | ||
|
||
* `download` - Downloads samples of either :ref:`domain_record` or HTML from common crawl indexes | ||
* `extract` - Downloads an HTML from Domain Record and extracts the content. It can also directly take the HTML and extract the data. | ||
|
||
Both functionalities are invoked using ``cmon`` followed by the functionality and the required arguments. | ||
The ``cmon`` command also takes a few optional arguments: | ||
|
||
--verbosity | ||
Verbosity level. Choices are [0, 1, 2], with 0 being the least verbose and 2 being the most verbose. Default is 1. | ||
|
||
--aws_profile | ||
AWS profile to use for AWS calls (Athena, S3). If not provided, the default AWS profile will be used. | ||
|
||
Examples | ||
-------- | ||
|
||
.. code-block:: bash | ||
# Download first 1000 domain records for example.com | ||
cmon download --match_type=domain --limit=1000 dr_output record example.com | ||
# Download first 100 htmls for example.com | ||
cmon download --match_type=domain --limit=100 html_output html example.com | ||
# Take the domain records downloaded using the first command and extracts them using your extractors | ||
cmon extract config.json extracted_output dr_output/*.jsonl record | ||
# Take the htmls downloaded using the second command and extracts them using your extractors | ||
cmon extract config.json extracted_output html_output/*.html html | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
Command Line Download | ||
===================== | ||
|
||
The download mode of the ``cmon`` command line tool serves to query and download from CommonCrawl indexes. | ||
The following arguments are needed in this order: | ||
|
||
Positional arguments | ||
-------------------- | ||
|
||
1. output - Path to output directory. | ||
|
||
2. {record,html} - Download mode: | ||
|
||
- record: Download record files from Common Crawl. | ||
- html: Download HTML files from Common Crawl. | ||
|
||
3. urls - URLs to download, e.g. www.bcc.cz. | ||
|
||
|
||
In html mode, the output directory will contain .html files, one | ||
for each found URL. In record mode, the output directory will contain | ||
``.jsonl`` files, each containing multiple domain records in JSON format. | ||
|
||
|
||
Options | ||
------- | ||
|
||
--limit LIMIT | ||
Max number of URLs to download. | ||
|
||
--since SINCE | ||
Start date in ISO format (e.g., 2020-01-01). | ||
|
||
--to TO | ||
End date in ISO format (e.g., 2020-01-01). | ||
|
||
--cc_server CC_SERVER | ||
Common Crawl indexes to query. Must provide the whole URL (e.g., https://index.commoncrawl.org/CC-MAIN-2023-14-index). | ||
|
||
--max_retry MAX_RETRY | ||
Max number of retries for a request. Increase this number when requests are failing. | ||
|
||
--sleep_base SLEEP_BASE | ||
Base sleep time for exponential backoff in case of request failure. | ||
|
||
--max_requests_per_second MAX_REQUESTS_PER_SECOND | ||
Max number of requests per second. | ||
|
||
--match_type MATCH_TYPE | ||
One of exact, prefix, host, domain | ||
Match type for the URL. Refer to cdx-api for more information. | ||
See :py:class:`cmoncrawl.common.types.MatchType` for more information. | ||
|
||
--max_directory_size MAX_DIRECTORY_SIZE | ||
Max number of files per directory. | ||
|
||
--filter_non_200 | ||
Filter out non-200 status code. | ||
|
||
--aggregator AGGREGATOR | ||
Aggregator to use for the query. | ||
|
||
- athena: Athena aggregator. Fastest, but requires AWS credentials with correct permissions. See :ref:`misc/athena:Athena` for more information. | ||
- gateway: Gateway aggregator (default). Very slow, but no need for AWS config. | ||
|
||
--s3_bucket S3_BUCKET | ||
S3 bucket to use for Athena aggregator. Only needed if using Athena aggregator. | ||
|
||
- If set the bucket will not be deleted after the query is done, allowing to reuse it for future queries. | ||
- If not set, a temporary bucket will be created and deleted after the query is done. | ||
|
||
.. note:: | ||
If you specify an S3 bucket, remember to delete it manually after you're done to avoid incurring unnecessary costs. | ||
|
||
|
||
Record mode options | ||
------------------- | ||
|
||
--max_crawls_per_file MAX_CRAWLS_PER_FILE | ||
Max number of domain records per file output | ||
|
||
HTML mode options | ||
----------------- | ||
|
||
--encoding ENCODING | ||
Force usage of specified encoding if possible. | ||
|
||
--download_method DOWNLOAD_METHOD | ||
Method for downloading warc files from Common Crawl, it only applies to HTML download. | ||
|
||
- api: Download from Common Crawl API Gateway. This is the default option. | ||
- s3: Download from Common Crawl S3 bucket. This is the fastest option, but requires AWS credentials with correct permissions. | ||
|
||
|
||
Examples | ||
-------- | ||
|
||
|
||
.. code-block:: bash | ||
# Download first 1000 domain records for example.com | ||
cmon download dr_output record --match_type=domain --limit=1000 example.com | ||
# Download first 100 htmls for example.com | ||
cmon download html_output html --match_type=domain --limit=100 example.com |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
Command line Extract | ||
==================== | ||
|
||
The extract mode of the ``cmon`` command line tool serves to extract data from your downloaded files. | ||
The following arguments are needed in this order: | ||
|
||
Positional arguments | ||
-------------------- | ||
|
||
1. config_path - Path to the config file containing extraction rules. | ||
|
||
2. output_path - Path to the output directory. | ||
|
||
3. {record,html} - Extraction mode: | ||
|
||
- record: Extract data from jsonl (domain record) files. | ||
- html: Extract data from HTML files. | ||
|
||
4. files - Files to extract data from (Either HTML files or .jsonl files). | ||
|
||
To create a config file, see :ref:`extractor_config`. | ||
|
||
Both modes yield the same output format, which is a ``.jsonl`` file containing the extracted data, | ||
one per line. For each file, a new directory is created in the output directory, named after the | ||
file. | ||
|
||
The files created by the download mode can be directly used with the appropriate mode | ||
in the extraction. | ||
|
||
- If you have an HTML file, you can use the HTML mode to extract it. | ||
- If you have a domain records, you can use the RECORD mode to extract it. | ||
- If you have domain records, which you acquired without using cmoncrawl, | ||
|
||
please refer to :ref:`domain_record_jsonl`, which describes how to create ``.jsonl`` files from your domain records, | ||
which you can then use with the record mode. | ||
|
||
Optional arguments | ||
------------------ | ||
|
||
--max_crawls_per_file MAX_CRAWLS_PER_FILE | ||
Max number of extractions per file output. | ||
|
||
--max_directory_size MAX_DIRECTORY_SIZE | ||
Max number of extraction files per directory. | ||
|
||
--n_proc N_PROC | ||
Number of processes to use for extraction. The parallelization is on file level, | ||
thus for a single file, it's useless to use more than one process. | ||
|
||
Record arguments | ||
---------------- | ||
|
||
--max_retry MAX_RETRY | ||
Max number of WARC download attempts. | ||
|
||
--download_method DOWNLOAD_METHOD | ||
Method for downloading warc files from Common Crawl, it only applies to HTML download. | ||
|
||
- api: Download from Common Crawl API Gateway. This is the default option. | ||
- s3: Download from Common Crawl S3 bucket. This is the fastest option, but requires AWS credentials with correct permissions. | ||
|
||
--sleep_base SLEEP_BASE | ||
Base value for exponential backoff between failed requests. | ||
|
||
--max_requests_per_second MAX_REQUESTS_PER_SECOND | ||
Max number of requests per second. | ||
|
||
Html arguments | ||
-------------- | ||
|
||
--date DATE | ||
Date of extraction of HTML files in ISO format (e.g., 2021-01-01). The default is today. | ||
|
||
--url URL | ||
URL from which the HTML files were downloaded. By default, it will try to infer from the file content. | ||
|
||
Examples | ||
-------- | ||
|
||
.. code-block:: bash | ||
# Take the domain records downloaded using the first command and extracts them using your extractors | ||
cmon extract config.json extracted_output dr_output/*.jsonl record --max_retry 100 --download_method=gateway --sleep_base 1.3 | ||
# Take the htmls downloaded using the second command and extracts them using your extractors | ||
cmon extract config.json extracted_output html_output/*.html html --date 2021-01-01 --url https://www.example.com | ||
When you are going to build the extractors, you will appreciate that you can specify | ||
what the URL of the HTML file is and what the date of the extraction is. This is because | ||
those information are used during the extractor routing. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
Command Line Interface | ||
====================== | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: Contents: | ||
|
||
cli | ||
download | ||
extract | ||
|
||
|
Oops, something went wrong.