Skip to content

We look to apply big data techniques to perform large-scale graph mining on web crawl data. We utilize the Common Crawl web dataset, which is an open dataset hosted on Amazon Web Services' cloud platform. This is an extremely large dataset that consists of petabytes of web crawl data.

Notifications You must be signed in to change notification settings

mrafayaleem/community-clusters

Repository files navigation

Community Clusters in Web Graphs using PySpark

Contains a bootstrap ETL to generate a parquet file of records from a single WARC file of common crawl web data.

Install modules

All code in this repository uses Python 3.6+ and PySpark 2.3+. First, install dependencies:

pip3 install -r requirements.txt

Directory information

  • bootstrap: Contains ETL code and the condensed parquet files with URL information
  • data: Contains results of graph analysis
  • public: Contains D3 visualizations of web graph clusters

ETL, Analysis and Visualization:

Use commands in RUNNING.md to perform analysis and generate results that are visualized in D3.

Example dataframe created by reading parquet files:

>>> sqlContext = SQLContext(sc)
>>> df = sqlContext.read.parquet("./bootstrap/spark-warehouse/<your-directory>*")

+--------------------+--------------------+-----------+--------------------+-----------+------------+
|              parent|           parentTLD|   childTLD|               child|childDomain|parentDomain|
+--------------------+--------------------+-----------+--------------------+-----------+------------+
|http://1separable...|1separable-43v3r....|twitter.com|http://twitter.co...|    twitter|     skyrock|
|      http://3msk.ru|             3msk.ru|    k--k.ru|http://k--k.ru/85...|       k--k|        3msk|
|      http://3msk.ru|             3msk.ru|    com9.ru|http://com9.ru/85...|       com9|        3msk|
|      http://3msk.ru|             3msk.ru|    com9.ru|http://com9.ru/85...|       com9|        3msk|
|      http://3msk.ru|             3msk.ru| top.vy3.ru|http://top.vy3.ru...|        vy3|        3msk|
+--------------------+--------------------+-----------+--------------------+-----------+------------+
only showing top 5 rows

Interactive Analysis:

To develop a workflow and make intuitive visualizations, use the Jupyter notebook graph_mining.ipynb to interactively query the data in PySpark. Requires the PySpark environment to be configured on the system:

export SPARK_HOME=/home/<user>/spark-2.3.1-bin-hadoop2.7/
export PYSPARK_PYTHON=python3

Open and run the graph_mining.ipynb notebook in a shell that used the above commands, so that the shell knows where to find PySpark on your system.

About

We look to apply big data techniques to perform large-scale graph mining on web crawl data. We utilize the Common Crawl web dataset, which is an open dataset hosted on Amazon Web Services' cloud platform. This is an extremely large dataset that consists of petabytes of web crawl data.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •