Link Reverse!

Mining of the CommonCrawl Corpus

The mining of the common crawl corpus has been done in Spark. My experimental source code is available. I did various experiments with mining links in documents, but at the end, settled on something relatively simple: just show which pages link to a certain URL.

Results

This webapp shows the results. There are two limitations: first, for tractability of the prototype, I am only including links to the domain mit.edu. Second, I've only mined the two first valid segments in CommonCrawl.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
app		app
conf		conf
project		project
public		public
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
README_run.md		README_run.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Link Reverse!

Mining of the CommonCrawl Corpus

Results

About

Releases

Packages

Languages

namin/linkrev

Folders and files

Latest commit

History

Repository files navigation

Link Reverse!

Mining of the CommonCrawl Corpus

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages