Skip to content

Latest commit

 

History

History
19 lines (15 loc) · 744 Bytes

README.md

File metadata and controls

19 lines (15 loc) · 744 Bytes

Link Reverse!

Mining of the CommonCrawl Corpus

The mining of the common crawl corpus has been done in Spark. My experimental source code is available. I did various experiments with mining links in documents, but at the end, settled on something relatively simple: just show which pages link to a certain URL.

Results

This webapp shows the results. There are two limitations: first, for tractability of the prototype, I am only including links to the domain mit.edu. Second, I've only mined the two first valid segments in CommonCrawl.