-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add German Wiktionary extractor code to parse page and extract glosses #342
Conversation
There is no authorative list of languages in the German Wiktionary within the dump file itself. The list of languages is generated from the page "Hilfe:Sprachkürzel" and is probably incomplete. Until another source is found, this is the best we can do. The list will then need to be manually completed.
The POS subtitles have been collected from the occurrences of the {{Wortart|...}} template in the German Wiktionary. The ones referring to actual POS are listed here and mapped to the POS tags used by the en and fr extractors.
The page module parses a page from the German Wiktionary and extracts the language of each language entry, the pos of each pos section and fixes the subsection hierarchy therein. The gloss module extracts the glosses from the relevant section. It's still quite crude and doesn't capture all subtleties of gloss entries.
Use #XXX for anything that isn't yet finished, it's just the convention in the codebase. debugs and warnings are... I don't actually use warnings myself. Debugs is for things that could be correct, but possibly isn't, for printing out stuff you need to know for debugging, the biggest category by far. I guess warnings are for more explicit "this shouldn't be like this" stuff, possibly caused by stuff on wiktionary-side that isn't our own bug. But it's vague, yeah. The sortid is just something I threw in there so that the log files have the same error kind next to each other when sorting, so it's just an arbitrary string. I usually put the file name, line number (approximate, doesn't really matter) and recently also the ISO date to really prevent collisions. |
I have moved the list of German form table templates to a different file. That makes it a little clearer, I think. I think it will be useful to have this comprehensive list of templates containing form data if we choose to extract it later. Though there is some argument to be made that the English Wiktionary edition most likely already includes all this form data. This means it deserves some investigation whether or not extracting form data from each Wiktionary edition is warranted. |
Since there was no further comment, I decided to keep the test for @xxyzz Any objection to merging this pull request? |
xxyzz is on vacation right now, but he'll be back at some point. I haven't kept up with the stuff happening in these extractors because I've been hitting my head on the wall trying to fix elusive, non-deterministic Lua bugs, but I had a small comment for one of the tests: when you use "gloss1" for each gloss in separate sections, you can't actually tell if the test fails in cases where the data from one section is accidentally contaminating another. This is unlikely, but possible if something goes spectacularly wrong. Otherwise looks fine, what I saw; de.wiktionary has different testing needs, as do all other editions, too, so all we can do is trust that those who implement the extractors know those idiosyncrasies and differences best. |
@kristian-clausal Fair point. And that actually solves my philosophical problem with the test. 😄 While still including the |
The commits do not seem to touch anything outside of the de.wiktionary extractor stuff, so as long as the tests pass I see no problem merging if you feel like you've reached a milestone. |
@@ -0,0 +1,119 @@ | |||
[ | |||
"Deutsch Substantiv Übersicht", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not including lots of hard coded data at the beginning, I think this long template list is unnecessary for these reasons:
- most can be found by a simple
template_name.endswith("Übersicht")
- and all these table templates won't be pre-expanded thus won't needed to be included in the
do_not_pre_expand
parameter. If they are pre-expaned then the problem isWtp.analyze_templates()
function marks too many pre-expand templates and should only be called by the English Wiktionary code and in the future we should make the pre-expand rules configurable like only check if the template contains list and ignore other cases.
I'd recommend you also take a look of the newer French extractor code which is simpler than the English extractor and doesn't use much hard code data and regex.
@@ -0,0 +1,72 @@ | |||
# Export German Wiktionary language data to JSON. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Languages JSON files are created from the code at here: https://github.com/tatuylonen/wiktextract/tree/master/languages
This code should be moved there called from the get_data.py
file.
Hi there!
After the great restructuring you did over the summer to support and maintain different extractors for different Wiktionary editions, I thought to jump right in, try out this new framework and start building an extractor for the German Wiktionary.
I will try to keep each pull request small but complete. Is that in your interest?
General questions
I have modelled my code on the French extractor and tried to follow the general style of the code base. In this regard maybe two preliminary questions:
sortid
supposed to look like (especially the number at the end)?