Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate links from the page dump instead of using the links dump #21

Closed
wants to merge 4 commits into from

Conversation

rmmh
Copy link
Contributor

@rmmh rmmh commented Feb 28, 2018

This lets us ignore noisy links from footer sections (See Also,
References, etc) and templates and simplifies the pipeline.

Extracting links in Go takes ~160 CPU minutes and <8GB RAM.

The 14GB multistream page has many independent bzip streams, allowing it
to be processed in parallel in <30 real minutes on a 4C/8T workstation
and potentially even less on a large VM instance.

This is based on #20, and includes all of its commits.

Fixes #19.

@jwngr jwngr mentioned this pull request Mar 3, 2018
@jwngr jwngr self-requested a review March 3, 2018 06:21
@jwngr jwngr self-assigned this Mar 3, 2018
rmmh added 3 commits March 2, 2018 22:29
Previously, it would leave partially-processed files (jamming up further
stages), or even continue executing (for Python).
latest-md5sums.txt is not kept up to date, so every download phase would
fail because the hashes don't match.
Aria2c is a commandline torrent downloader that is *much* faster than
individual Wikipedia HTTP mirrors (>10MB/s for many files). It will be
optionally used if present in $PATH.
@rmmh rmmh force-pushed the generate-from-pages branch from 7ae2158 to bcaf067 Compare March 3, 2018 06:31
This lets us ignore noisy links from footer sections (See Also,
References, etc) and templates and simplifies the pipeline.

Extracting links in Go takes ~160 CPU minutes and <8GB RAM.

The 14GB multistream page has many independent bzip streams, allowing it
to be processed in parallel in <30 real minutes on a 4C/8T workstation
and potentially even less on a large VM instance.
@rmmh rmmh force-pushed the generate-from-pages branch from bcaf067 to 3dcf67a Compare March 3, 2018 06:33
@jwngr
Copy link
Owner

jwngr commented Mar 10, 2018

Thanks for this PR! I really appreciate your excitement and effort. Unfortunately, I've never written Go and I don't quite feel comfortable accepting so much critical Go code into my database build pipeline. I know how many edge cases I came across from simply parsing the Wikipedia SQL files, I can only imagine how many there will be when parsing the page text itself. If a bug does come up in the Go code, I'll likely be unable to solve it without a lot of wasted effort. So I think I'm going to close this and maybe revisit #19 another time.

Thanks again!

@jwngr jwngr closed this Mar 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants