Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release Proper Nouns as Separate WordNet #1126

Open
jmccrae opened this issue Oct 25, 2024 · 11 comments
Open

Release Proper Nouns as Separate WordNet #1126

jmccrae opened this issue Oct 25, 2024 · 11 comments
Milestone

Comments

@jmccrae
Copy link
Member

jmccrae commented Oct 25, 2024

I propose that starting with the 2025 release, we remove all the instance hyponyms from the release. These would be made in a separate release that is closely connected to Wikidata. I will complete a manual linking between Wikidata and OEWN to enable this.

This issue tracks the progress on this issue.

Any advice on how best to do this would be welcome.

@goodmami
Copy link
Member

Any advice on how best to do this would be welcome.

Not sure about "best", but one option is a WN-LMF 1.1+ lexicon extension. We have also discussed creating ILIs that are not CILI for domain-specific identifiers.

@jmccrae
Copy link
Member Author

jmccrae commented Nov 29, 2024

I propose that the 2025 release consists of three separate versions (XML files)

  • Open English Wordnet (oewn:2025) - all common nouns, verbs, adjectives and verbs (~100k synsets)
  • Open English Termnet Mini (oewn:2025-terms-mini) - roughly the same as the current release (~120k synsets)
  • Open English Termnet Full (oewn:2025-terms-full) - Adding entities from Wikidata such as people, locations, etc. (~20M synsets)

I like @goodmami's suggestion that we use lexicon extensions, but also I wonder if just releasing these as a single stand-alone file would not be easier.

For domain-specific identifiers, we will use Wikidata, as we already have some coverage of these Q-identifiers.

@jmccrae jmccrae added this to the 2025 Release milestone Nov 29, 2024
@arademaker
Copy link
Member

Regarding wikidata links, the current plan seems to be map synsets to Q items. What about the lexical items from Wikidata? I believe this is a space we can also contribute.

@goodmami
Copy link
Member

For the ili, maybe we can just use the Q identifiers from Wikidata instead of assigning new CILI ids for each?

I like @goodmami's suggestion that we use lexicon extensions, but also I wonder if just releasing these as a single stand-alone file would not be easier.

For lexicographers, I think the only non-trivial part of producing an extension is setting up external entities if you need to link things to the original lexicon.

For users, however, one downside is that currently Wn is, as far as I know, the only software package that supports lexicon extensions.

@jmccrae
Copy link
Member Author

jmccrae commented Dec 2, 2024

For the ili, maybe we can just use the Q identifiers from Wikidata instead of assigning new CILI ids for each?

Yes, we should probably start to use Q IDs, at least for any proper nouns.

I like @goodmami's suggestion that we use lexicon extensions, but also I wonder if just releasing these as a single stand-alone file would not be easier.

For lexicographers, I think the only non-trivial part of producing an extension is setting up external entities if you need to link things to the original lexicon.

For users, however, one downside is that currently Wn is, as far as I know, the only software package that supports lexicon extensions.

I suspect that technically it would be easier for users to have one big file, the extension modelling is quite tricky and doesn't make the file much smaller.

@goodmami
Copy link
Member

goodmami commented Dec 3, 2024

I suspect that technically it would be easier for users to have one big file, the extension modelling is quite tricky and doesn't make the file much smaller.

Fair point. A reduced file size isn't the main point, but in any case the external entities do not get created in the Wn database as they should already exist from the original lexicon; their only purpose is so the XML IDREFs work within the file.

A more crucial use-case is if we have more than one additional dataset. This could be if we wanted to break up Wikidata into separate files for people, places, etc., or if we have another source entirely. The more there are, the less appealing it is to compile with/without single-file variants of OEWN for all combinations.

It does seem convenient to have pre-compiled versions of the likely important combinations.

@fcbond
Copy link
Member

fcbond commented Dec 4, 2024 via email

@jmccrae
Copy link
Member Author

jmccrae commented Dec 4, 2024

Hi, I think it is a good idea to use the Q IDs, and maybe even to allow more (like geonames identifiers).

One of the advantages of linking to Wikidata is that Geonames and other databases are then linked from this. As most of the locations in WordNet can be mapped to Wikidata, I don't think we will need to link to two databases. For the few that aren't linked, I will probably try to create Wikidata pages with links to Geonames (like Sealyham which has about 20 houses).

In practice, I am not sure how we would manage this. I don't think we can use the same field, as there will be some words that have both an ILI and a Q ID, and people should be able to access them with either. Maybe we have a small set (start with ILI and Q IDS) and distinguish them with the first letter (i or q)? In the wordnet database, presumably they would then have separate fields, and I guess also in the schema?

Yes, it may be a good idea to add a wikidata field to the schema as we have an ILI field. This is essentially what we are doing in OEWN internally.

@fcbond
Copy link
Member

fcbond commented Dec 4, 2024

I want to link to geonames so that we can access locations not in wordnet :-). So for your goal of removing instances, it is fine, but for the longer goal of covering everything everywhere, I would like to plan ahead.

@jmccrae
Copy link
Member Author

jmccrae commented Dec 4, 2024 via email

@goodmami
Copy link
Member

goodmami commented Dec 5, 2024

@goodmami How hard would it be to create a merged DB by loading an
extension and then dumping the combined lexicon? [...]

Hmm, probably not as hard as exporting an extension (which seemed non-trivial when I thought about it a few years ago: goodmami/wn#103), but it would still require some code changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants