Skip to content

Commit

Permalink
Merge pull request #67 from hynky1999/docs
Browse files Browse the repository at this point in the history
docs
  • Loading branch information
hynky1999 authored May 12, 2023
2 parents 63d4cce + 5a066a3 commit 369eed6
Show file tree
Hide file tree
Showing 588 changed files with 118,832 additions and 0 deletions.
Binary file added docs/build/doctrees/api.doctree
Binary file not shown.
Binary file added docs/build/doctrees/cli/cli.doctree
Binary file not shown.
Binary file added docs/build/doctrees/cli/download.doctree
Binary file not shown.
Binary file added docs/build/doctrees/cli/extract.doctree
Binary file not shown.
Binary file added docs/build/doctrees/cli/index.doctree
Binary file not shown.
Binary file added docs/build/doctrees/environment.pickle
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/build/doctrees/extraction/index.doctree
Binary file not shown.
Binary file added docs/build/doctrees/extraction/utils.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/build/doctrees/generated/cmoncrawl.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/build/doctrees/index.doctree
Binary file not shown.
Binary file added docs/build/doctrees/misc/domain_record.doctree
Binary file not shown.
Binary file added docs/build/doctrees/misc/index.doctree
Binary file not shown.
Binary file added docs/build/doctrees/prog_guide/index.doctree
Binary file not shown.
Binary file added docs/build/doctrees/prog_guide/overview.doctree
Binary file not shown.
Binary file added docs/build/doctrees/prog_guide/pip.doctree
Binary file not shown.
Binary file added docs/build/doctrees/usage.doctree
Binary file not shown.
4 changes: 4 additions & 0 deletions docs/build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: aa56b11fc3400b25742e9c52f456c98e
tags: 645f666f9bcd5a90fca523b33c5a78b7
14 changes: 14 additions & 0 deletions docs/build/html/_sources/api.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
API
===

.. autosummary::
:recursive:
:toctree: generated


cmoncrawl





41 changes: 41 additions & 0 deletions docs/build/html/_sources/cli/cli.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
.. _cli:

Command Line Interface
======================

The command line interface is a simple wrapper around the library.

It provides the two main functionalities:

* `download` - Downloads samples of either :ref:`domain_record` or HTML from common crawl indexes
* `extract` - Downloads an HTML from Domain Record and extracts the content. It can also directly take the HTML and extract the data.

Both functionalities are invoked using ```cmon``` followed by the functionality and the required arguments.

Examples
--------

.. code-block:: bash
# Download first 1000 domain records for example.com
cmon download --match_type=domain --limit=1000 example.com dr_output record
# Download first 100 htmls for example.com
cmon download --match_type=domain --limit=100 example.com html_output html
# Take the domain records downloaded using the first command and extracts them using your extractors
cmon extract config.json extracted_output dr_output/*/*.jsonl record
# Take the htmls downloaded using the second command and extracts them using your extractors
cmon extract config.json extracted_output html_output/*/*.html html
74 changes: 74 additions & 0 deletions docs/build/html/_sources/cli/download.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Command Line Download
=====================

The download mode of the ```cmon``` command line tool servers to query and download from CommonCrawl indexes.
The following arguments are needed in this order:

Positional arguments
--------------------

1. url - URL to query.

2. output - Path to output directory.

3. {record,html} - Download mode:

- record: Download record files from Common Crawl.
- html: Download HTML files from Common Crawl.


In html mode, the output directory will contain .html files, one
for each found URL. In record mode, the output directory will contain
```.jsonl``` files, each containing multiple domain records in JSON format.


Options
-------

--limit LIMIT
Max number of URLs to download.

--since SINCE
Start date in ISO format (e.g., 2020-01-01).

--to TO
End date in ISO format (e.g., 2020-01-01).

--cc_server CC_SERVER
Common Crawl indexes to query. Must provide the whole URL (e.g., https://index.commoncrawl.org/CC-MAIN-2023-14-index).

--max_retry MAX_RETRY
Max number of retries for a request. Increase this number when requests are failing.

--sleep_step SLEEP_STEP
Number of additional seconds to add to the sleep time between each failed download attempt. Increase this number if the server tells you to slow down.

--match_type MATCH_TYPE
One of exact, prefix, host, domain
Match type for the URL. Refer to cdx-api for more information.

--max_directory_size MAX_DIRECTORY_SIZE
Max number of files per directory.

--filter_non_200
Filter out non-200 status code.

Record mode options
-------------------

--max_crawls_per_file MAX_CRAWLS_PER_FILE
Max number of domain records per file output



Examples
--------


.. code-block:: bash
# Download first 1000 domain records for example.com
cmon download --match_type=domain --limit=1000 example.com dr_output record
# Download first 100 htmls for example.com
cmon download --match_type=domain --limit=100 example.com html_output html
84 changes: 84 additions & 0 deletions docs/build/html/_sources/cli/extract.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
Command line Extract
====================

The extract mode of the ```cmon``` command line tool servers to extract your download files.
The following arguments are needed in this order:

Positional arguments
--------------------


1. config_path - Path to config file containing extraction rules.

2. output_path - Path to output directory.

3. files - Files to extract data from.

4. {record,html} - Extraction mode:

- record: Extract data from jsonl (domain record) files.
- html: Extract data from HTML files.

To create a config file, see :ref:`extractor_config`.

Both modes yield the same output format, which is a ```.jsonl``` file containing the extracted data,
one per line. For each file a new directory is created in the output directory, named after the
file.

The files created by the download mode, can be directly used with appropriate mode
in the extraction. If you have an html file, you can use the html mode to extract it.
If you have a domain records, which you got some other way (AWS Athena), please refer to :ref:`domain_record_jsonl`,
which describes how to create ```.jsonl``` files from your domain records, which you can then
use with the record mode.





Optional arguments
------------------

--max_crawls_per_file MAX_CRAWLS_PER_FILE
Max number of extractions per file output.

--max_directory_size MAX_DIRECTORY_SIZE
Max number of extraction files per directory.

--n_proc N_PROC
Number of processes to use for extraction. The paralelization is on file level,
thus for single file it's useless to use more than one process.

Record arguments
----------------

--max_retry MAX_RETRY
Max number of WARC download attempts.

--sleep_step SLEEP_STEP
Number of additional seconds to add to the sleep time between each failed download attempt.

Html arguments
--------------

--date DATE
Date of extraction of HTML files in ISO format (e.g., 2021-01-01). The default is today.

--url URL
URL from which the HTML files were downloaded. By default, it will try to infer from the file content.


Examples
--------

.. code-block:: bash
# Take the domain records downloaded using the first command and extracts them using your extractors
cmon extract config.json extracted_output dr_output/*/*.jsonl record --max_retry 100 --sleep_step 10
# Take the htmls downloaded using the second command and extracts them using your extractors
cmon extract config.json extracted_output html_output/*/*.html html --date 2021-01-01 --url https://www.example.com
When you are going to build the extractors, you gonna appreaciate that you can specify
what the url of the html file is and what the date of the extraction is. This is because
those information are used during the extractor routing.
12 changes: 12 additions & 0 deletions docs/build/html/_sources/cli/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Command Line Interface
======================

.. toctree::
:maxdepth: 2
:caption: Contents:

cli
download
extract


132 changes: 132 additions & 0 deletions docs/build/html/_sources/extraction/config_file.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
.. _extractor_config:

Extractor config file
==========================

Structure
---------

In order to specify which extractor to use, you need to create a config
The structure is following:

.. code-block:: json
{
"extractors_path": "Path to the extractors folder",
"routes": [
{
"regexes": [".*"],
"extractors": [{
"name": "my_extractor",
"since": "iso date string",
"to": "iso date string"
},
{
"name": "my_extractor2",
}
]
},
{
"regexes": ["another_regex"],
"....": "....
}
]
}
The ``extractors_path`` is the path to the folder where the extractors are located.

.. note::
The extractors_path is relative to the current working directory.


The ``routes`` is a list of routes. Each route is a dictionary with the following keys:

* ``regexes``: a list of regexes. At least one regex must match the url, for this route to be used.
* ``extractors``: a list of extractors that will be used to extract the data from the url.


Each extractor has the following keys:

* ``name``: the name of the extractor. This is the name of the python file without the .py extension, you can also set NAME variable in the extractor file to override this.
* ``since`` [optional] : The starting crawl date for which the extractor is valid. It must be full iso date string (e.g. 2009-01-01T00:00:00+00:00)
* ``to`` [optional] : The ending crawl date for which the extractor is valid. Format is the same as for ``since``.

.. note::
If ``since`` and ``to`` are not specified, the extractor will be used for all crawls.


Example
-------

Given the following folder structure:

.. code-block:: text
extractors/
├── a_extractor.py
├── a_extractor2.py
└── b_extractor.py
and the following config:

.. code-block:: json
{
"extractors_path": "./extractors",
"routes": [
{
"regexes": [".*cmon.cz.*"],
"extractors": [{
"name": "a_extractor",
"to": "2010-01-01T00:00:00+00:00"
},
{
"name": "a_extractor2",
"since": "2010-01-01T00:00:00+00:00"
}
]
},
{
"regexes": [".*cmon2.cz.*"],
"extractors": [{
"name": "b_extractor",
}
]
}
]
}
The following will happen:

* A domain record with url http://www.cmon.cz, cralwed on 2012 will be extracted using the a_extractor2.py extractor.
* A domain record with url http://www.cmon.cz, cralwed on 2009 will be extracted using the a_extractor.py extractor.
* A domain record with url http://www.cmon2.cz, cralwed on 2012 will be extracted using the b_extractor.py extractor.


`__init__.py`
-------------
You might want to put the common code of the extractors into
a common python file. The problem is that during the execution,
the extractors directory is not in the python path. To add the extractors
directory we also load `__init__.py`` file (But don't add load extractors in it).

Thus you can create `__init__.py` file in the extractors directory with the following content:

.. code-block:: python
import sys
from pathlib import Path
sys.path.append(Path(__file__).parent)
which will add the extractors directory to the python path.


Arbitrary Code Execution
------------------------
.. warning::
Since the router, loads and executes all files in the extractors
directory, every .py file in this directory is executed. Thus
you should not put any untrusted files in this directory.
Loading

0 comments on commit 369eed6

Please sign in to comment.