-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how do I parse wikipedia dump file? #294
Comments
This library parses the wikitext only. You need to use another library to parse the XML file to get the wikitext. See e.g. https://stackoverflow.com/questions/16533153/parse-xml-dump-of-a-mediawiki-wiki |
On that link, it loads the entire file to memory; this will not be possible with the dump |
Then you need to find a different parser. |
Check out mwxml, a library designed for this specific task (parsing Wikipedia XML dumps):
The mwxml Dump class is an iterator which reads pages one at a time, so you can avoid loading the whole file at once. |
Thanks for the library!
I have the latest xml dump file, and I would like to use your library to parse the infoboxes from the dump. However, I don't see any function to stream the file. Could you share an example of how I could pass the content of a page to the mwparserfromhell.parse(text) function to extract any infobox?
If this helps, this is what I have got so far
iter_lines() is a function which uses ET.iterparse() to incrementally parse the XML; it returns a generator
The text was updated successfully, but these errors were encountered: