Generate links from the page dump instead of using the links dump #21

rmmh · 2018-02-28T06:34:43Z

This lets us ignore noisy links from footer sections (See Also,
References, etc) and templates and simplifies the pipeline.

Extracting links in Go takes ~160 CPU minutes and <8GB RAM.

The 14GB multistream page has many independent bzip streams, allowing it
to be processed in parallel in <30 real minutes on a 4C/8T workstation
and potentially even less on a large VM instance.

This is based on #20, and includes all of its commits.

Fixes #19.

Previously, it would leave partially-processed files (jamming up further stages), or even continue executing (for Python).

latest-md5sums.txt is not kept up to date, so every download phase would fail because the hashes don't match.

Aria2c is a commandline torrent downloader that is *much* faster than individual Wikipedia HTTP mirrors (>10MB/s for many files). It will be optionally used if present in $PATH.

This lets us ignore noisy links from footer sections (See Also, References, etc) and templates and simplifies the pipeline. Extracting links in Go takes ~160 CPU minutes and <8GB RAM. The 14GB multistream page has many independent bzip streams, allowing it to be processed in parallel in <30 real minutes on a 4C/8T workstation and potentially even less on a large VM instance.

jwngr · 2018-03-10T21:48:13Z

Thanks for this PR! I really appreciate your excitement and effort. Unfortunately, I've never written Go and I don't quite feel comfortable accepting so much critical Go code into my database build pipeline. I know how many edge cases I came across from simply parsing the Wikipedia SQL files, I can only imagine how many there will be when parsing the page text itself. If a bug does come up in the Go code, I'll likely be unable to solve it without a lot of wasted effort. So I think I'm going to close this and maybe revisit #19 another time.

Thanks again!

jwngr mentioned this pull request Mar 3, 2018

Improve database build #20

Merged

jwngr self-requested a review March 3, 2018 06:21

jwngr self-assigned this Mar 3, 2018

rmmh added 3 commits March 2, 2018 22:29

Fix ctrl-c behavior on pipelines/python.

517f062

Previously, it would leave partially-processed files (jamming up further stages), or even continue executing (for Python).

Fix "latest" download link to actually be the latest dump.

9da791c

latest-md5sums.txt is not kept up to date, so every download phase would fail because the hashes don't match.

Refactor downloading to use a common path and torrents (if available).

fa07dd9

Aria2c is a commandline torrent downloader that is *much* faster than individual Wikipedia HTTP mirrors (>10MB/s for many files). It will be optionally used if present in $PATH.

rmmh force-pushed the generate-from-pages branch from 7ae2158 to bcaf067 Compare March 3, 2018 06:31

rmmh force-pushed the generate-from-pages branch from bcaf067 to 3dcf67a Compare March 3, 2018 06:33

jwngr closed this Mar 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate links from the page dump instead of using the links dump #21

Generate links from the page dump instead of using the links dump #21

rmmh commented Feb 28, 2018 •

edited

Loading

jwngr commented Mar 10, 2018

Generate links from the page dump instead of using the links dump #21

Generate links from the page dump instead of using the links dump #21

Conversation

rmmh commented Feb 28, 2018 • edited Loading

jwngr commented Mar 10, 2018

rmmh commented Feb 28, 2018 •

edited

Loading