Contains a bootstrap ETL to generate a parquet file of records from a single WARC file of common crawl web data.
All code in this repository uses Python 3.6+ and PySpark 2.3+. First, install dependencies:
pip3 install -r requirements.txt
bootstrap
: Contains ETL code and the condensed parquet files with URL informationdata
: Contains results of graph analysispublic
: Contains D3 visualizations of web graph clusters
Use commands in RUNNING.md to perform analysis and generate results that are visualized in D3.
Example dataframe created by reading parquet files:
>>> sqlContext = SQLContext(sc)
>>> df = sqlContext.read.parquet("./bootstrap/spark-warehouse/<your-directory>*")
+--------------------+--------------------+-----------+--------------------+-----------+------------+
| parent| parentTLD| childTLD| child|childDomain|parentDomain|
+--------------------+--------------------+-----------+--------------------+-----------+------------+
|http://1separable...|1separable-43v3r....|twitter.com|http://twitter.co...| twitter| skyrock|
| http://3msk.ru| 3msk.ru| k--k.ru|http://k--k.ru/85...| k--k| 3msk|
| http://3msk.ru| 3msk.ru| com9.ru|http://com9.ru/85...| com9| 3msk|
| http://3msk.ru| 3msk.ru| com9.ru|http://com9.ru/85...| com9| 3msk|
| http://3msk.ru| 3msk.ru| top.vy3.ru|http://top.vy3.ru...| vy3| 3msk|
+--------------------+--------------------+-----------+--------------------+-----------+------------+
only showing top 5 rows
To develop a workflow and make intuitive visualizations, use the Jupyter notebook graph_mining.ipynb
to interactively query the data in PySpark. Requires the PySpark
environment to be configured on the system:
export SPARK_HOME=/home/<user>/spark-2.3.1-bin-hadoop2.7/
export PYSPARK_PYTHON=python3
Open and run the graph_mining.ipynb
notebook in a shell that used the above commands, so that the shell knows where to find PySpark on your system.