Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pt-PT] Improved .AFF files for both AO45 and AO90 #11

Merged
merged 2 commits into from
Jan 29, 2024

Conversation

marcoagpinto
Copy link
Member

Heya @susanaboatto

The AFF changes added around 1600 verbal forms to pt-PT:

AO45:
3.PTPT_45_new_verbs.txt

AO90:
6.PTPT_90_new_verbs.txt

@marcoagpinto marcoagpinto added the enhancement New feature or request label Jan 23, 2024
@p-goulart
Copy link
Collaborator

Is this all you need for pt-PT? Or should we be expecting more additions?

@p-goulart p-goulart changed the base branch from main to tagging/clitics January 29, 2024 08:51
@marcoagpinto
Copy link
Member Author

Heya, @p-goulart

For now, it is what I have changed in the .aff .

I will add words in the future to the .dic and I also must do that thing I said of comparing the wordlist of PT-PT with PT-BR.

@marcoagpinto
Copy link
Member Author

This change adds around 1600 verb forms to pt-PT which will fix tons of words appearing as typos while writing text.

@p-goulart
Copy link
Collaborator

Sure, but the most important question is whether it outputs largely the same forms as those output by the PoS tagger. Adding forms with mo-l[oa]s? to one suffix flag may not add that much coverage.

@marcoagpinto
Copy link
Member Author

Sure, but the most important question is whether it outputs largely the same forms as those output by the PoS tagger. Adding forms with mo-l[oa]s? to one suffix flag may not add that much coverage.

I don't understand what you mean.

It adds the words I placed in the first comment:
AO45:
3.PTPT_45_new_verbs.txt

AO90:
6.PTPT_90_new_verbs.txt

@marcoagpinto
Copy link
Member Author

They would appear as typos, and now that should no longer happen.

@p-goulart
Copy link
Collaborator

Outputting new forms is good, but the important thing for the work we are doing now is making sure that pt-PT verb forms are the same as those output by the PoS tagger dictionary.

As we discussed in the past, we are changing our tagger to include enclitic pronouns as a part of the verb forms. The string ama-te will be a single verb form, ama-te, tagged V$some_tags:PP$some_tags.

The Hunspell .aff files must output the same forms. Otherwise there will be a discrepancy between the speller's verb forms and those of the tagger. Which will cause inconsistencies in the tagging and spellchecking.

@marcoagpinto
Copy link
Member Author

abolimo-la
abolimo-las
abolimo-lo
abolimo-los

Screenshot 2024-01-29 at 09-27-40 Análise de Texto - LanguageTool

I thought @susanaboatto was working on it?

For example: abolimo-la in the future should appear as:
VMIP1P0X:PP3FSA00
?

@p-goulart
Copy link
Collaborator

Yes, we are working on it, on the branch that I've just changed this PR to point to. Those two things must happen in parallel.

@marcoagpinto
Copy link
Member Author

What shall I do then?

The words my patch add are valid, but I don't know how to do a: VMIP1P0X:PP3FSA00 in the tags.

Susana is the right person to help with that.

@marcoagpinto
Copy link
Member Author

In simple words,

abolimo-la
abolimo-las
abolimo-lo
abolimo-los

will no longer appear as typos, but they won't show: VMIP1P0X:PP3FSA00 , etc.

@marcoagpinto
Copy link
Member Author

@p-goulart
Will you take care of this?

Right now, I can't focus on more flags for the .aff.

@p-goulart
Copy link
Collaborator

p-goulart commented Jan 29, 2024

The words here are fine in and of themselves, I'm just pointing out that they are not all we need.

If I run a simple unmunch test on a basic pt-PT verb like amar/XYPL, I don't get a bunch of forms, e.g.:

ama-te
ama-se
ame-se
ama-o

etc.

You don't need to do anything with the PoS tags. The only thing that is required is for the pt-PT speller scripts to output the same forms as the PoS tagger scripts.

@p-goulart p-goulart merged commit a7a53c1 into tagging/clitics Jan 29, 2024
@p-goulart p-goulart deleted the lt_marcoagpinto_20240123_1145 branch January 29, 2024 09:40
@marcoagpinto
Copy link
Member Author

Ahhhhh....

They are missing?

ama-te
ama-se
ame-se
ama-o

I will work on it in a few days.

Thanks for letting me know.

@p-goulart
Copy link
Collaborator

These forms work currently only incidentally, yes, because (for example) both ama and te exist as separate words. But if the speller dictionary's inflector doesn't output them, they won't be considered correctly spelt... which is an issue, since the new tokeniser will recognise ama-te as a single token.

I will attach here a list of forms needed for a regular verb. (This doesn't include a bunch of irregular verbs that are simply not handled by the pt-PT .aff files.)

@marcoagpinto
Copy link
Member Author

These forms work currently only incidentally, yes, because (for example) both ama and te exist as separate words. But if the speller dictionary's inflector doesn't output them, they won't be considered correctly spelt... which is an issue, since the new tokeniser will recognise ama-te as a single token.

I will attach here a list of forms needed for a regular verb. (This doesn't include a bunch of irregular verbs that are simply not handled by the pt-PT .aff files.)

Thanks, that way I can focus on it better.

@p-goulart
Copy link
Collaborator

verb-test-out.csv

@marcoagpinto
Copy link
Member Author

Ahhhhhhh

@p-goulart
Copy link
Collaborator

I can also attach here the files for other verbs. We'll need stuff like qui-lo, pu-lo, qué-lo, soubé-lo, etc.

@marcoagpinto
Copy link
Member Author

Sure, I will add the rules bit by bit, I won't do all at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants