Processed 172,803,584 links and ranked 8,166,507 pages from Wikipedia, using PageRank algorithm.
The top ranked pages, excluding pages that are internal or administrative for Wikipedia, are in rank.txt
.
Process the raw wiki links data obtained using wikicrush with help from Kaiyuan.
-
Scan through
indexbi.bin
. Create offset -> index HashMap. Store the offset inoffset.bin
in the order of their indexes for recovery of the article title later. -
Read
indexbi.bin
. Only write to the new filelinks.bin
[ pageNum, { pageIdx,linkNum,[ linkedPageIdx, ... ] }, ... ]. Every number is converted from little-endian to big-endian.
In the cache-efficient version, the links file is divided into several files, based on Haveliwala's algorithm, to exploit locality.