Add German Wiktionary extractor code to parse page and extract glosses #342

empiriker · 2023-09-20T07:24:55Z

Hi there!

After the great restructuring you did over the summer to support and maintain different extractors for different Wiktionary editions, I thought to jump right in, try out this new framework and start building an extractor for the German Wiktionary.

I will try to keep each pull request small but complete. Is that in your interest?

General questions

I have modelled my code on the French extractor and tried to follow the general style of the code base. In this regard maybe two preliminary questions:

Do you still use the # XXX flag to mark todos/stuff that isn't properly extracted yet?
Is there any guidance on when to log a warning vs. a debug statement? What is the sortid supposed to look like (especially the number at the end)?

There is no authorative list of languages in the German Wiktionary within the dump file itself. The list of languages is generated from the page "Hilfe:Sprachkürzel" and is probably incomplete. Until another source is found, this is the best we can do. The list will then need to be manually completed.

The POS subtitles have been collected from the occurrences of the {{Wortart|...}} template in the German Wiktionary. The ones referring to actual POS are listed here and mapped to the POS tags used by the en and fr extractors.

The page module parses a page from the German Wiktionary and extracts the language of each language entry, the pos of each pos section and fixes the subsection hierarchy therein. The gloss module extracts the glosses from the relevant section. It's still quite crude and doesn't capture all subtleties of gloss entries.

kristian-clausal · 2023-09-20T07:59:06Z

Use #XXX for anything that isn't yet finished, it's just the convention in the codebase.

debugs and warnings are... I don't actually use warnings myself. Debugs is for things that could be correct, but possibly isn't, for printing out stuff you need to know for debugging, the biggest category by far. I guess warnings are for more explicit "this shouldn't be like this" stuff, possibly caused by stuff on wiktionary-side that isn't our own bug. But it's vague, yeah.

The sortid is just something I threw in there so that the log files have the same error kind next to each other when sorting, so it's just an arbitrary string. I usually put the file name, line number (approximate, doesn't really matter) and recently also the ISO date to really prevent collisions.

wiktextract/extractor/de/page.py

empiriker · 2023-10-01T19:25:38Z

I have moved the list of German form table templates to a different file. That makes it a little clearer, I think.

I think it will be useful to have this comprehensive list of templates containing form data if we choose to extract it later. Though there is some argument to be made that the English Wiktionary edition most likely already includes all this form data. This means it deserves some investigation whether or not extracting form data from each Wiktionary edition is warranted.

empiriker · 2023-10-05T06:22:39Z

Since there was no further comment, I decided to keep the test for parse_section() that makes the most sense to me.

@xxyzz Any objection to merging this pull request?

tests/test_de_page.py

kristian-clausal · 2023-10-05T06:34:19Z

xxyzz is on vacation right now, but he'll be back at some point. I haven't kept up with the stuff happening in these extractors because I've been hitting my head on the wall trying to fix elusive, non-deterministic Lua bugs, but I had a small comment for one of the tests: when you use "gloss1" for each gloss in separate sections, you can't actually tell if the test fails in cases where the data from one section is accidentally contaminating another. This is unlikely, but possible if something goes spectacularly wrong. Otherwise looks fine, what I saw; de.wiktionary has different testing needs, as do all other editions, too, so all we can do is trust that those who implement the extractors know those idiosyncrasies and differences best.

empiriker · 2023-10-05T06:54:49Z

@kristian-clausal Fair point. And that actually solves my philosophical problem with the test. 😄 While still including the extract_glosses function in its test scope, it at least tests something that unittesting extract_glosses will never test: How its results get distributed to the different page_data items.

kristian-clausal · 2023-10-05T06:56:16Z

The commits do not seem to touch anything outside of the de.wiktionary extractor stuff, so as long as the tests pass I see no problem merging if you feel like you've reached a milestone.

wiktextract/extractor/de/page.py

xxyzz · 2023-10-07T02:37:14Z

wiktextract/data/de/form_tables.json

@@ -0,0 +1,119 @@
+[
+  "Deutsch Substantiv Übersicht",


I'd prefer not including lots of hard coded data at the beginning, I think this long template list is unnecessary for these reasons:

most can be found by a simple template_name.endswith("Übersicht")

and all these table templates won't be pre-expanded thus won't needed to be included in the do_not_pre_expand parameter. If they are pre-expaned then the problem is Wtp.analyze_templates() function marks too many pre-expand templates and should only be called by the English Wiktionary code and in the future we should make the pre-expand rules configurable like only check if the template contains list and ignore other cases.

I'd recommend you also take a look of the newer French extractor code which is simpler than the English extractor and doesn't use much hard code data and regex.

xxyzz · 2023-10-07T02:42:21Z

usertools/de_language_data.py

@@ -0,0 +1,72 @@
+# Export German Wiktionary language data to JSON.


Languages JSON files are created from the code at here: https://github.com/tatuylonen/wiktextract/tree/master/languages

This code should be moved there called from the get_data.py file.

empiriker added 4 commits September 19, 2023 21:15

Add pos_subtitles.json for German extractor

180a67e

The POS subtitles have been collected from the occurrences of the {{Wortart|...}} template in the German Wiktionary. The ones referring to actual POS are listed here and mapped to the POS tags used by the en and fr extractors.

Add other_subtitles.json for German extractor

6b97e90

xxyzz reviewed Sep 21, 2023

View reviewed changes

wiktextract/extractor/de/page.py Outdated Show resolved Hide resolved

xxyzz reviewed Sep 21, 2023

View reviewed changes

wiktextract/extractor/de/page.py Show resolved Hide resolved

empiriker and others added 2 commits October 1, 2023 21:50

Merge branch 'tatuylonen:master' into de

a184e5f

Move list of German form tables to separate file

41d3ce6

Add temporarily different variants of test_de_parse_section

e6d3bfa

kristian-clausal reviewed Oct 5, 2023

View reviewed changes

tests/test_de_page.py Outdated Show resolved Hide resolved

Test German parse_section() with at least some sense data

e3af868

empiriker force-pushed the de branch from 3e1c8fc to e3af868 Compare October 5, 2023 06:45

kristian-clausal merged commit ca9e913 into tatuylonen:master Oct 5, 2023
3 checks passed

xxyzz reviewed Oct 7, 2023

View reviewed changes

xxyzz mentioned this pull request Oct 7, 2023

Simplify parse_page() in extractor code #353

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add German Wiktionary extractor code to parse page and extract glosses #342

Add German Wiktionary extractor code to parse page and extract glosses #342

empiriker commented Sep 20, 2023 •

edited

Loading

kristian-clausal commented Sep 20, 2023

empiriker commented Oct 1, 2023

empiriker commented Oct 5, 2023

kristian-clausal commented Oct 5, 2023

empiriker commented Oct 5, 2023 •

edited

Loading

kristian-clausal commented Oct 5, 2023

xxyzz Oct 7, 2023

xxyzz Oct 7, 2023

		@@ -0,0 +1,72 @@
		# Export German Wiktionary language data to JSON.

Add German Wiktionary extractor code to parse page and extract glosses #342

Add German Wiktionary extractor code to parse page and extract glosses #342

Conversation

empiriker commented Sep 20, 2023 • edited Loading

General questions

kristian-clausal commented Sep 20, 2023

empiriker commented Oct 1, 2023

empiriker commented Oct 5, 2023

kristian-clausal commented Oct 5, 2023

empiriker commented Oct 5, 2023 • edited Loading

kristian-clausal commented Oct 5, 2023

xxyzz Oct 7, 2023

Choose a reason for hiding this comment

xxyzz Oct 7, 2023

Choose a reason for hiding this comment

empiriker commented Sep 20, 2023 •

edited

Loading

empiriker commented Oct 5, 2023 •

edited

Loading