Add pydantic models and parse_page for Spanish Wiktionary #392

empiriker · 2023-10-31T16:05:24Z

Hi there,

I started to add basic support for the Spanish Wiktionary and attempted to integrate with pydantic for data validation and schema generation.

Why I think pydantic might be a good addition to the project:

Allows creation of dynamic json schemas directly from code (also typescript interfaces)
Adds validation and (optionally) type casting to ensure consistency of our output
Can flexibly change and adapt to different Wiktionary editions (via creating or adapting pydantic classes via inheritance)
Replaces the current type annotations -> replaces something we already do with something that does even more

The potential downsides:

More boilerplate code
Performance impact (?)
Dealing with validation errors
Adds another dependency

At the moment, I have only setup the basic pydantic models and parse_page function for the Spanish Wiktionary. I will still need to build a little more to properly assess whether pydantic really fits the use case and is easy to work with.

For now, I will mark this as a draft and look forward to hear your feedback and answer potential questions you have regarding pydantic.

src/wiktextract/extractor/es/models.py

empiriker · 2023-10-31T16:09:26Z

src/wiktextract/extractor/es/models.py

+
+class BaseModelWrap(BaseModel):
+    class Config:
+        extra = "ignore"


If any keys are added to a pydantic class that are not defined in the class definition, they will be ignored.

This shouldn't be ignored, many classes in the current code already has __slots__. And LoggingExtraFieldsModel will also not needed.

I am not quite sure what's the significance of __slots__ in this context but let me explain my reasoning for configuring the base model to ignore extra fields.`

I think there are two goals in this context:

The pydantic models should describe the output accurately.

We want to capture as much data as possible.

In the German extractor, for example, I used the following pattern to extract the references to an example from a wikicode template:

for key, value in template_node.template_parameters.items(): if isinstance(key, str): reference_data[key.lower()] = clean_node(wxr, {}, value)

In this example, the expected keys aren't defined anywhere in our code but rely implicitly on the arguments of that particular template. In this particular case, that makes a lot of sense since the reference template is quite detailed and exactly what we would like to capture.

However, when defining the json schema or using pydantic without the option extra = "ignore", there is always the risk that our output has keys that aren't documented anywhere. That might not be a problem in and of itself but certainly makes this data a lot less valuable. We also wouldn't satisfy goal 1.

With the option extra = "ignore", only keys that are defined in the pydantic model will be output and all others suppressed. This does satisfy goal 1.

However, we might still like to capture these keys (goal 2). For this reason, I am logging all extra fields that our code tries to assign but aren't defined in the model. This will enable us to make a decision on whether or not to include this key in our model.

In short: The combination of ignoring extra fields while logging these in a debug statement will force us to keep the pydantic models (and thus our output documentation) in sync with what we do in our code. I think this is the main benefit of using pydantic over manually maintaining json schemas (as currently done for the French extractor).

The Config class is deprecated, ConfigDict or model class kwargs should be used: https://docs.pydantic.dev/latest/concepts/config/

The default value of extra is ignore, maybe the base model should use forbid.

I read the doc again and find extra only controls model initialization, it doesn't check undefined fields after the initialization. Since pydantic model classes are defined now, the extractor code shouldn't use dict like assignment now, they should use class attributes so the update and get methods could be removed. Update: I see... our old code like clean_code uses .get(), we should update clean_code then. But if we allow arbitrary fields to be added that kind of miss the point of using pydantic...

Isn't the JSON schema file generated from the defined model classes? How could they contain fields created dynamically at runtime?

As for your German data example, I think we should not check the fields(or generate JSON schema) not created in our code.

Thanks for the thorough review and pointing my in the right direction!

The default value of extra is ignore, maybe the base model should use forbid.

With the option forbid, we would get an exception every time there is an extra field. This would mean that we need to add error handling every time we instantiate a pydantic model. I think that would be extremely cumbersome and I didn't find a good way to catch this exception in an abstract way... There is an interesting PR suggesting a fourth option warn which would be ideal for our case.

Isn't the JSON schema file generated from the defined model classes? How could they contain fields created dynamically at runtime?

You're right. The JSON schema file will only contain fields defined in the model classes.

As for your German data example, I think we should not check the fields(or generate JSON schema) not created in our code.

The main question is thus: Do we want that our JSON schemas strictly describe our output, i.e. the output contains only keys that are described in our JSON schemas? Or do we not mind if the output contains extra fields?

My preference and also the PR's is the first option.

Sorry I missed this comment. If the code is initializing a pydantic model with undefined fields and get exception, that means the code and the model doesn't match and that's the whole point of validation, right? The only way we could get this exception is from our own mistake but not from the changes in Wiktionary.

~~OK, I guess it's fine to use the default ignore setting, since the undefined data are not added.~~

No, I think it's better to use forbid, the exception will be caught in page_handler() and added to the error log.

src/wiktextract/extractor/es/models.py

src/wiktextract/extractor/es/page.py

empiriker · 2023-10-31T16:15:03Z

I have added some code comments to highlight some of the features of pydantic. Don't hesitate to ask questions if things are unclear.

xxyzz · 2023-11-01T01:54:38Z

I think it's fine as long as pydantic and the validation code don't get in the way and don't hurt performance. I'm kind worried about overriding Python std library's update method. pydantic should be able to check what data are added to the WordEntry class, if it can't validate dict.update, maybe update shouldn't be used.

empiriker · 2023-11-23T13:09:22Z

Hi there! It took longer than expected to come back to this but the next few days should see some progress.

Let me comment on the outstanding pydantic questions first.

I think it's fine as long as pydantic and the validation code don't get in the way and don't hurt performance. I'm kind worried about overriding Python std library's update method. pydantic should be able to check what data are added to the WordEntry class, if it can't validate dict.update, maybe update shouldn't be used.

I am not overriding Python's update method, I am simply adding a new method to the pydantic data model. In the most recent version, I could simplify a b it since, indeed, by default pydantic already validates the input when using setattr.

I also had to add the methods get, __getitem__ and __setitem__ to the base model. This doesn't override anything but makes the pydantic classes work with the programming pattern we used when dealing with defaultdicts.

empiriker · 2023-11-23T14:12:04Z

I am now a bit stuck since there seems to be problem with wikitextprocessor's parser. In the parse tree, there is some gloss content missing that is present in the wikicode. I added a basic test file to illustrate this.

Here the same example from amor:

;1 {{sentimientos}}: {{plm|sentimiento}} [[afectivo]] de [[atracción]], [[unión]] y [[afinidad]] que se experimenta hacia una persona, animal o cosa

In the Spanish Wiktionary, senses are defined as list items starting with ;, followed by a number, then optionally a sense tag and then the actual gloss after :. However, the parser seems to ignore everything after the colon.

I am not too familiar with the parser itself. If you could point me to the right spot where to fix this, I would be grateful. Thanks!

Here bigger example for ayudar:

;1: {{plm|contribuir}} [[esfuerzo]] o [[recurso]]s para la [[realización]] de algo.
{{sinónimo|asistir|auxiliar|colaborar|favorecer|arrimar el hombro|echar una mano|tender la mano|echar un cable|echar un capote|ir a una|socorrer la plaza}}
{{antónimo|estorbar}}
:*'''Ejemplos:'''
::"este comedio llegaron quatro naos de Portugal, e mosén Charles rogó al capitán que le ''ayudase'' a tomarlas, por quanto heran henemigos de Françia, que ayudavan a los yngleses. " Díaz de Games, Gutierre (1994 [1449]) ''El Victorial''. Madrid: Taurus, p. 429

;2: Por antonomasia, [[cooperar]] a que alguno [[salir|salga]] de una [[situación]] [[dificultoso|dificultosa]]
{{sinónimo|amparar|asistir|auxiliar|socorrer}}
{{antónimo|abandonar|desamparar}}
:*'''Ejemplos:'''
::"Cómo el rey don Pedro se fue del reyno, buscando quien le ''ayudase'', por quanto la mayor parte de los cavalleros de Castilla se alçavan contra él. " Díaz de Games, Gutierre (1994 [1449]) ''El Victorial''. Madrid: Taurus, p. 223

Senses always seem to start as list items with ;. Each sense is then followed by several optional data types. Here synonyms, antonyms, examples.

Intuitively, we would parse this to one big list with two list items ;1: ... and ;2: .... Currently, however, all lines would be parsed as sibling nodes.

It's probably not feasible to change the parsing logic to reflect the intuitive ordering and I will have to deal with this in code. Yet, I would appreciate any suggestion on how best to deal with this.

The relevant documentation from Wiktionary is here: Page structure

Many thanks in advance!

xxyzz · 2023-11-24T02:26:49Z

Nodes after ":" are saved to the WikiNode.definition attribute. It's created at here: https://github.com/tatuylonen/wikitextprocessor/blob/de5b4c77014c7bb9e1f710a00c8779478e3ac5c3/src/wikitextprocessor/parser.py#L694-L705

empiriker · 2023-11-24T15:08:19Z

Thanks, @xxyzz. I haven't worked with the WikiNode.definition before. Always great to learn something new!

I have added gloss extraction now and so far am quite happy with using pydantics to build the Spanish extractor. If you agree with my approach, I would consider it ready to be merged.

I will keep working on extracting more lexical data. Either in this PR or the next one.

xxyzz · 2023-11-27T03:00:27Z

src/wiktextract/extractor/es/models.py

+
+class BaseModelWrap(BaseModel):
+    class Config:
+        extra = "ignore"


The Config class is deprecated, ConfigDict or model class kwargs should be used: https://docs.pydantic.dev/latest/concepts/config/

The default value of extra is ignore, maybe the base model should use forbid.

I read the doc again and find extra only controls model initialization, it doesn't check undefined fields after the initialization. Since pydantic model classes are defined now, the extractor code shouldn't use dict like assignment now, they should use class attributes so the update and get methods could be removed. Update: I see... our old code like clean_code uses .get(), we should update clean_code then. But if we allow arbitrary fields to be added that kind of miss the point of using pydantic...

Isn't the JSON schema file generated from the defined model classes? How could they contain fields created dynamically at runtime?

As for your German data example, I think we should not check the fields(or generate JSON schema) not created in our code.

xxyzz · 2023-11-27T03:02:46Z

src/wiktextract/extractor/es/models.py

+class LoggingExtraFieldsModel(BaseModelWrap):
+    @model_validator(mode="before")
+    def log_extra_fields(cls, values):
+        all_allowed_field_names = {key for key in cls.__fields__.keys()}


__fields__ is deprecated, model_fields_set should be used.

Maybe you'd like to make this debug log disabled by default, because there will be lots of debug messages spamming the log.

Thanks again! It's model_fields that we need here. No idea, why they don't mention this property in the doc but it's clearly the replacement for __fields__.

Maybe you'd like to make this debug log disabled by default, because there will be lots of debug messages spamming the log.

Interestingly, there shouldn't be a lot of spam from this logging call. At least how I see it used during development and maintaining of the extractors:

If we set a field manually that isn't defined in a model, that's clearly an oversight and the model should be updated. To not miss this, the log can be useful and would disappear afterwards.

for key, value in template_node.template_parameters.items(): if isinstance(key, str): reference_data[key.lower()] = clean_node(wxr, {}, value)

In a case like above, the code suggests that the decision has been made to indeed capture all the fields. Than it's just a matter of running the extractor code once and update the models accordingly. Again, the debug messages would disappear afterwards.

If only same keys from that template_node are supposed to be extracted, then this should be done explicitly in code. Again not that many/no debug messages.

Undefined field shouldn't be allowed to be assigned. In your template parameters case, we could defined that field in its parent model as dictionary or Any.

I would like to avoid not specifying the keys in a dictionary. IMO, the JSON schema is there to inform users what to expect. Extracting data into an unspecified dict is perhaps still better than not extracting but it's immensely yes useful to many applications without knowledge what's in the dict.

I feel like we are still not on the same page. Probably due to a lack of good examples. Perhaps looking at my WIP branch es-ahead will help:

Here I apply the template parameters case to parse a reference template. When I encounter an unknown key, I get a debug message and can update the pydantic model.

Obviously, there are other ways to achieve the same effect. I thought it intuitive to also delegate the task of notifying us about the ignored fields to pydantic, since (with the option extra="ignore) it checks for unspecified fields anyway. But if you think it's a bad idea, I can get rid of LoggingExtraFieldsModel.

I didn't know you would break the template arguments that much... I don't know how pydantic model could fit in this case, maybe it shouldn't be used for dynamically created data because that code will break every time when the template is changed. I'm not sure... You could experiment with the code and see what is best for your usage cases and also easy to maintain.

Other extractors mostly use a ref field for example source, maybe you could only define ref field in model and copy template arguments to a dictionary field which won't be checked. But that's just my suggestion, you could use other implementations.

I think the most benefit we could get from pydantic is the validation, so random things won't be added from somewhere no one is watching. If arbitrary keys are still be able to be added we're no better than before. Some fields like saving template arguments should be allowed to do that but not other models/fields.

So even though the code looks like it's adding data to undefined field but in reality it is not. If you use pydantic this way, the code will be much longer and require more efforts to maintain, and therefore it's worse then our current defaultdict approach. I think you have to make compromise to the JSON schema, I don't think you could use dynamic model creation. Don't use pydantic model for dynamically created data is the best solution IMO, otherwise you might need more complex code and it's not worth it.

And we already have a JSON data structure browser page on kaikki.org.

@xxyzz I am very grateful for your comments and that you're really thinking this through alongside with me but I have to admit that right now I am also a bit lost what your exact worries are. So I am hesitant to change something in the code right now without knowing exactly what I am addressing.

Perhaps, I can try to address your points in writing and you could specify what needs to be addressed/changed in my proposal?

Just to make sure we are on the same page, here are the three different options how a pydantic model deals with extra values:

allow - Allow any extra attributes.

forbid - Forbid any extra attributes.

ignore - Ignore any extra attributes.

With the default value of extra="ignore", your example does result in:

from pydantic import BaseModel, ConfigDict, model_validator class MyBaseModel(BaseModel): model_config = ConfigDict(validate_assignment=True) username: str m = MyBaseModel(username="a", password="b") # (A) no error, password gets ignored getattr(m, "password") # (B) AttributeError: 'MyBaseModel' object has no attribute 'password' setattr(m, "foo", "bar") # (C) pydantic_core._pydantic_core.ValidationError: 1 validation error for MyBaseModel foo print(getattr(m, "foo")) # (D) AttributeError: 'MyBaseModel' object has no attribute 'foo'

(A) Works exactly as intended:

password gets ignored (we only get keys in our output that is described in our schema/classes)

We do not get an error and the code just continues executing.

When inheriting from class LoggingExtraFieldsModel(BaseModelWrap), the extra field password would get logged and we could update the schema/class.

And we still get type validation on the defined field username.

(B) and (D) do not worry me:

Instance attributes should be accessed as attributes: m.username

That also makes for a better developer experience since my IDE tells me which attributes exist.

We could, of course, overwrite the getattr function to return None for undefined keys (like defaultdict) but I don't see a point in this.

If absolutely necessary, one could always use try/except.

(C) is in my eyes a bit problematic but solvable:

Indeed, I realize now that my previous example of add_template_params_to_reference() does only work since I defined the dict-like setter using an exception:

class BaseModelWrap(BaseModel): def __setitem__(self, item, value): try: setattr(self, item, value) except ValidationError: pass

Without redefining __setitem__, I would need to use try/catch locally.

This would then mimic the behaviour of extra arguments on class instantiation: Extra arguments would just be ignored but logged, and an update of an existing key with the wrong type would simply not be executed. IMO, the behaviour we want.

I hope that clear up the technical question marks.

I will now try to answer your concerns.

I don't know how pydantic model could fit in this case, maybe it shouldn't be used for dynamically created data because that code will break every time when the template is changed.

No. As I see it, nothing would break since unknown keys don't throw errors (case (A)) or would need to be caught anway (case (C)). If a template would change and introduce a new key, e.g. {template|new=foo}, we would simply not extract the new key and potentially get a debug message.

Maybe you could only define ref field in model and copy template arguments to a dictionary field which won't be checked.

I prefer making the structure explicit but, yes, that is a valid option that we can flexibly choose. For example, if you were to adopt pydantic for the French extractor you could type ref as an iterable type:

class Example(BaseClass): ref: dict = None

or

class Example(BaseClass): ref: Dict[str,str] = None

Or if you wanted to define some keys but not all you could overwrite the "extra" configuration:

class Reference(BaseClass): model_config = ConfigDict(extra="allow") title: str class Example(BaseClass): ref: Reference ref = Reference(title="Hamlet",author="Shakespeare") print(ref.model_dump()) # {"title":"Hamlet", "author":"Shakespeare"}

If you use pydantic this way, the code will be much longer and require more efforts to maintain, and therefore it's worse then our current defaultdict approach.

I fail to see it. It might a little bit more work initially but, IMO, is much more maintainable than manual JSON schemas.

Once some base classes are defined for the Spanish extractor, they can also be reused for other extractors via inheritance while still easily specifying where they differ, e.g.:

class PronunciationBase(BaseClass): ipa: str class SpanishPronunciation(PronunciationBase): syllabic: List[str] = Field(default=[], description="Syllabic transcription") class RussianPronunciation(PronunciationBase): stress: str = Field(None, description="Word with stress markings")

Or am I missing something?

dynamic model creation
I wasn't aware of that. Thanks! It's a cool feature and we could theoretically use it to output a dynamic schema at the end of each extraction run. But it would be a hassle with multithreading and add very little benefit in the end...

And we already have a JSON data structure browser page on kaikki.org.

Very useful indeed! But as I understand it, this is more exploratory in its nature? It still doesn't replace a proper and detailed JSON schema.

Perhaps lastly, let me explain my use case: I want to convert the JSON Schema for each extractor to a Typescript type (either from the schema or from pydantic directly (pydantic-to-typescript)) and then use these types to build a frontend, flexibly displaying lexical information extracted from wiktextract. Having these types during development will make things so much easier.

This is why I think our schema should be as detailed as possible. I am almost certain, I wouldn't use a field like ref: Dict[str,str] = None since I had no idea what's in there. But in

class Reference(BaseClass): title: str

I might at least use the title field.

I hope that cleared things up.

Where do you see blocking issues at the moment?

Our main disagreement is our priority. Our priority is getting more data from the dump file, but your priority is getting JSON schema. Your code can't adapt to changes on Wiktionary, it needs constant maintenance to sync models with templates to add new template arguments. Because you're using pydantic model on data that shouldn't be validated.

You can get better JSON schema by running through the generated jsonl file like what kaikki.org did. And if you're trying to use dynamic model creation, that'd be just like scanning the jsonl file, but more meaningless validation and slower.

Back to my example code, all 4 errors are bugs in our Python code and shouldn't be ignored. Set forbid, we get error log with one line code. You have to loose the validation on data you don't have control but have to add and harden the validation on data you do control. I think you're pursuing a goal that pydantic can't fulfill. But I haven't use pydantic before, I could be wrong.

I think you have a minor misunderstanding about the ref field in other extractors, it is a str, the plain text of the example sentence source. In the French extractor there are no template arguments dictionary but the English extractor has many. dict[str, str] or Any is exactly what I suggested before but you choose to use a pydantic model. It's even more easier if you don't translate the arguments to English but just copy them.

We could get this pr merged if you could update the typing, some are still using Dict. I think process_pos_block() could get index out of range error because page_data is empty.

Okay. Got it.

I think there is a way to have our cake and eat it when it comes to template arguments dictionaries. But let's deal with it when I get to it. If it won't work out then, I am fine with using dict[str,str] for these kind of fields.

I think you have a minor misunderstanding about the ref field in other extractors, it is a str, the plain text of the example sentence source.

I am aware of this. At the very beginning when I started contributing here, someone told me that you do not want to enforce a strict schema for all extractors but rather see what makes sense for each Wiktionary edition. So while modelling the new extractors on the existing ones, I didn't think it would hurt to be more specific. In the case of ref, why only extract the plain text if there is more structured data to have? In the German extractor, for example, I have stored the plain text with the raw_ref within a ref dictionary.

We could get this pr merged if you could update the typing, some are still using Dict. I think process_pos_block() could get index out of range error because page_data is empty.

Done. process_pos_block doesn't get an index out of range error since it appends a copy of base_data into page_data before using index access.

Sure, we're not using the same schema but we also don't want to much variations. In the case of quote template arguments, we could get the same result with more stable and clean code by copying them without validation and translation.

src/wiktextract/datautils.py

This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.

xxyzz · 2023-12-01T08:30:52Z

Thanks for your contribution!

kristian-clausal · 2023-12-01T09:59:19Z

Thanks for your contribution from me too!