Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add German Wiktionary extractor code to parse page and extract glosses #342

Merged
merged 8 commits into from
Oct 5, 2023

Conversation

empiriker
Copy link
Contributor

@empiriker empiriker commented Sep 20, 2023

Hi there!

After the great restructuring you did over the summer to support and maintain different extractors for different Wiktionary editions, I thought to jump right in, try out this new framework and start building an extractor for the German Wiktionary.

I will try to keep each pull request small but complete. Is that in your interest?

General questions

I have modelled my code on the French extractor and tried to follow the general style of the code base. In this regard maybe two preliminary questions:

  1. Do you still use the # XXX flag to mark todos/stuff that isn't properly extracted yet?
  2. Is there any guidance on when to log a warning vs. a debug statement? What is the sortid supposed to look like (especially the number at the end)?

There is no authorative list of languages in the German Wiktionary
within the dump file itself. The list of languages is generated from
the page "Hilfe:Sprachkürzel" and is probably incomplete. Until another
source is found, this is the best we can do. The list will then need to
be manually completed.
The POS subtitles have been collected from the occurrences of the
{{Wortart|...}} template in the German Wiktionary. The ones referring to
actual POS are listed here and mapped to the POS tags used by the
en and fr extractors.
The page module parses a page from the German Wiktionary and extracts
the language of each language entry, the pos of each pos section and
fixes the subsection hierarchy therein.

The gloss module extracts the glosses from the relevant section. It's
still quite crude and doesn't capture all subtleties of gloss entries.
@kristian-clausal
Copy link
Collaborator

Use #XXX for anything that isn't yet finished, it's just the convention in the codebase.

debugs and warnings are... I don't actually use warnings myself. Debugs is for things that could be correct, but possibly isn't, for printing out stuff you need to know for debugging, the biggest category by far. I guess warnings are for more explicit "this shouldn't be like this" stuff, possibly caused by stuff on wiktionary-side that isn't our own bug. But it's vague, yeah.

The sortid is just something I threw in there so that the log files have the same error kind next to each other when sorting, so it's just an arbitrary string. I usually put the file name, line number (approximate, doesn't really matter) and recently also the ISO date to really prevent collisions.

@empiriker
Copy link
Contributor Author

I have moved the list of German form table templates to a different file. That makes it a little clearer, I think.

I think it will be useful to have this comprehensive list of templates containing form data if we choose to extract it later. Though there is some argument to be made that the English Wiktionary edition most likely already includes all this form data. This means it deserves some investigation whether or not extracting form data from each Wiktionary edition is warranted.

@empiriker
Copy link
Contributor Author

Since there was no further comment, I decided to keep the test for parse_section() that makes the most sense to me.

@xxyzz Any objection to merging this pull request?

@kristian-clausal
Copy link
Collaborator

xxyzz is on vacation right now, but he'll be back at some point. I haven't kept up with the stuff happening in these extractors because I've been hitting my head on the wall trying to fix elusive, non-deterministic Lua bugs, but I had a small comment for one of the tests: when you use "gloss1" for each gloss in separate sections, you can't actually tell if the test fails in cases where the data from one section is accidentally contaminating another. This is unlikely, but possible if something goes spectacularly wrong. Otherwise looks fine, what I saw; de.wiktionary has different testing needs, as do all other editions, too, so all we can do is trust that those who implement the extractors know those idiosyncrasies and differences best.

@empiriker
Copy link
Contributor Author

empiriker commented Oct 5, 2023

@kristian-clausal Fair point. And that actually solves my philosophical problem with the test. 😄 While still including the extract_glosses function in its test scope, it at least tests something that unittesting extract_glosses will never test: How its results get distributed to the different page_data items.

@kristian-clausal kristian-clausal merged commit ca9e913 into tatuylonen:master Oct 5, 2023
3 checks passed
@kristian-clausal
Copy link
Collaborator

The commits do not seem to touch anything outside of the de.wiktionary extractor stuff, so as long as the tests pass I see no problem merging if you feel like you've reached a milestone.

wiktextract/extractor/de/page.py Show resolved Hide resolved
@@ -0,0 +1,119 @@
[
"Deutsch Substantiv Übersicht",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not including lots of hard coded data at the beginning, I think this long template list is unnecessary for these reasons:

  • most can be found by a simple template_name.endswith("Übersicht")
  • and all these table templates won't be pre-expanded thus won't needed to be included in the do_not_pre_expand parameter. If they are pre-expaned then the problem is Wtp.analyze_templates() function marks too many pre-expand templates and should only be called by the English Wiktionary code and in the future we should make the pre-expand rules configurable like only check if the template contains list and ignore other cases.

I'd recommend you also take a look of the newer French extractor code which is simpler than the English extractor and doesn't use much hard code data and regex.

@@ -0,0 +1,72 @@
# Export German Wiktionary language data to JSON.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Languages JSON files are created from the code at here: https://github.com/tatuylonen/wiktextract/tree/master/languages

This code should be moved there called from the get_data.py file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants