Release Proper Nouns as Separate WordNet #1126

jmccrae · 2024-10-25T12:20:12Z

I propose that starting with the 2025 release, we remove all the instance hyponyms from the release. These would be made in a separate release that is closely connected to Wikidata. I will complete a manual linking between Wikidata and OEWN to enable this.

This issue tracks the progress on this issue.

Any advice on how best to do this would be welcome.

goodmami · 2024-10-31T03:07:46Z

Any advice on how best to do this would be welcome.

Not sure about "best", but one option is a WN-LMF 1.1+ lexicon extension. We have also discussed creating ILIs that are not CILI for domain-specific identifiers.

jmccrae · 2024-11-29T13:57:31Z

I propose that the 2025 release consists of three separate versions (XML files)

Open English Wordnet (oewn:2025) - all common nouns, verbs, adjectives and verbs (~100k synsets)
Open English Termnet Mini (oewn:2025-terms-mini) - roughly the same as the current release (~120k synsets)
Open English Termnet Full (oewn:2025-terms-full) - Adding entities from Wikidata such as people, locations, etc. (~20M synsets)

I like @goodmami's suggestion that we use lexicon extensions, but also I wonder if just releasing these as a single stand-alone file would not be easier.

For domain-specific identifiers, we will use Wikidata, as we already have some coverage of these Q-identifiers.

arademaker · 2024-11-29T14:01:01Z

Regarding wikidata links, the current plan seems to be map synsets to Q items. What about the lexical items from Wikidata? I believe this is a space we can also contribute.

goodmami · 2024-11-29T18:17:45Z

For the ili, maybe we can just use the Q identifiers from Wikidata instead of assigning new CILI ids for each?

I like @goodmami's suggestion that we use lexicon extensions, but also I wonder if just releasing these as a single stand-alone file would not be easier.

For lexicographers, I think the only non-trivial part of producing an extension is setting up external entities if you need to link things to the original lexicon.

For users, however, one downside is that currently Wn is, as far as I know, the only software package that supports lexicon extensions.

jmccrae · 2024-12-02T10:43:16Z

For the ili, maybe we can just use the Q identifiers from Wikidata instead of assigning new CILI ids for each?

Yes, we should probably start to use Q IDs, at least for any proper nouns.

I like @goodmami's suggestion that we use lexicon extensions, but also I wonder if just releasing these as a single stand-alone file would not be easier.

For lexicographers, I think the only non-trivial part of producing an extension is setting up external entities if you need to link things to the original lexicon.

For users, however, one downside is that currently Wn is, as far as I know, the only software package that supports lexicon extensions.

I suspect that technically it would be easier for users to have one big file, the extension modelling is quite tricky and doesn't make the file much smaller.

goodmami · 2024-12-03T01:00:28Z

I suspect that technically it would be easier for users to have one big file, the extension modelling is quite tricky and doesn't make the file much smaller.

Fair point. A reduced file size isn't the main point, but in any case the external entities do not get created in the Wn database as they should already exist from the original lexicon; their only purpose is so the XML IDREFs work within the file.

A more crucial use-case is if we have more than one additional dataset. This could be if we wanted to break up Wikidata into separate files for people, places, etc., or if we have another source entirely. The more there are, the less appealing it is to compile with/without single-file variants of OEWN for all combinations.

It does seem convenient to have pre-compiled versions of the likely important combinations.

fcbond · 2024-12-04T09:36:27Z

Hi, I think it is a good idea to use the Q IDs, and maybe even to allow more (like geonames identifiers). In practice, I am not sure how we would manage this. I don't think we can use the same field, as there will be some words that have both an ILI and a Q ID, and people should be able to access them with either. Maybe we have a small set (start with ILI and Q IDS) and distinguish them with the first letter (i or q)? In the wordnet database, presumably they would then have separate fields, and I guess also in the schema? So you can have an ili and a Qid? I worry that if we end up with too many, then it will be messy, but I guess in practice we don't expect many more? The candidates I would think of are geonames and the species database, but I don't know if they are completely subsumed by wikidata, ... A quick look online suggests that wikidata links around 20% of geonames, so maybe still worth having as separate. When we built the geonames wordnet, we anticipated people using subsets of it (an overall common, and then one or more region specific DBs) as otherwise it is very big. @goodmami How hard would it be to create a merged DB by loading an extension and then dumping the combined lexicon? Then we could easily have separate lexions and also create a few commonly used combinations, as you suggested.

…

On Tue, 3 Dec 2024 at 02:00, Michael Wayne Goodman ***@***.***> wrote: I suspect that technically it would be easier for users to have one big file, the extension modelling is quite tricky and doesn't make the file much smaller. Fair point. A reduced file size isn't the main point, but in any case the external entities do not get created in the Wn database as they should already exist from the original lexicon; their only purpose is so the XML IDREFs work within the file. A more crucial use-case is if we have more than one additional dataset. This could be if we wanted to break up Wikidata into separate files for people, places, etc., or if we have another source entirely. The more there are, the less appealing it is to compile with/without single-file variants of OEWN for all combinations. It does seem convenient to have pre-compiled versions of the likely important combinations. — Reply to this email directly, view it on GitHub <#1126 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIPZRVGEYODXMCO2XSZRAD2DT7EHAVCNFSM6AAAAABQTFFBPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJTGMYDSNZZGU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Francis Bond <https://fcbond.github.io/>

jmccrae · 2024-12-04T10:02:52Z

Hi, I think it is a good idea to use the Q IDs, and maybe even to allow more (like geonames identifiers).

One of the advantages of linking to Wikidata is that Geonames and other databases are then linked from this. As most of the locations in WordNet can be mapped to Wikidata, I don't think we will need to link to two databases. For the few that aren't linked, I will probably try to create Wikidata pages with links to Geonames (like Sealyham which has about 20 houses).

In practice, I am not sure how we would manage this. I don't think we can use the same field, as there will be some words that have both an ILI and a Q ID, and people should be able to access them with either. Maybe we have a small set (start with ILI and Q IDS) and distinguish them with the first letter (i or q)? In the wordnet database, presumably they would then have separate fields, and I guess also in the schema?

Yes, it may be a good idea to add a wikidata field to the schema as we have an ILI field. This is essentially what we are doing in OEWN internally.

fcbond · 2024-12-04T10:17:07Z

I want to link to geonames so that we can access locations not in wordnet :-). So for your goal of removing instances, it is fine, but for the longer goal of covering everything everywhere, I would like to plan ahead.

jmccrae · 2024-12-04T10:55:30Z

There are very few instances that are in GeoNames and not Wikidata and I think the easier way to capture these is to modify Wikidata, as anything that is in both Geonames and OEWN should meet their notability requirements <https://www.wikidata.org/wiki/Wikidata:Notability>. In fact the list of such entities is entirely islands that are more commonly referred to as geo-political entities: Anguilla (island), Bermuda (island), Montserrat (island), British West Indies (islands), Guadeloupe (islands), Tuvalu (islands), Faroe islands (islands), Philippine islands (islands), Seychelles (islands). Regards, John Ar Céad 4 Noll 2024 ag 10:17, scríobh Francis Bond ***@***.***

…

: I want to link to geonames so that we can access locations not in wordnet :-). So for your goal of removing instances, it is fine, but for the longer goal of covering everything everywhere, I would like to plan ahead. — Reply to this email directly, view it on GitHub <#1126 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAK2VZ7JUSCOXRXR4AOGT732D3JDVAVCNFSM6AAAAABQTFFBPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJWHA2DOMJQGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

goodmami · 2024-12-05T06:44:12Z

@goodmami How hard would it be to create a merged DB by loading an
extension and then dumping the combined lexicon? [...]

Hmm, probably not as hard as exporting an extension (which seemed non-trivial when I thought about it a few years ago: goodmami/wn#103), but it would still require some code changes.

rob-ross · 2025-01-10T20:48:35Z

Has anyone thought about adding a new attribute to explicitly mark proper nouns, adjectives and adverbs? E.g.:

<LexicalEntry id="oewn-Japanese-a">
  <Lemma writtenForm="Japanese" partOfSpeech="a" isProper="True">
    <Pronunciation>ˌdʒæpəˈniːz</Pronunciation>
  </Lemma>

...

I'm a software dev, and experience teaches me it's better to be explicit when possible as opposed to implicitly determining something from some other attribute, such as "proper nouns are Lemmas without hyphens that are capitalized". Except for the exceptions... :)

This attribute could also go in the LexicalEntry tag. A simpler approach would be to introduce a new partOfSpeech character for proper nouns, but I suppose this would require a lot of changes to existing code so it's probably not feasible? You could probably automate back-filling the new attribute with the correct value and handle the exceptions with some manual analysis.

Then again, I'm guessing the universe of possible proper nouns is larger than the current word count, so I do see value in moving proper words to a separate "expansion pack." But if you could tag proper words in the main xml file, tools could use this to filter them out if not wanted.

Just my suggestion. :)

Rob

jmccrae · 2025-01-13T10:19:14Z

I agree that 'proper noun' should probably be its own category, we cannot use captialization as there are difficult cases like 'A-bomb'. We can either do this with an extra property as proposed above (which is similar to how we already model postpositive adjectives) or another option is to create a new lexfile for proper nouns.

rob-ross · 2025-01-15T19:53:05Z

There are "core" proper nouns that are used often enough to justify keeping them in the standard xml file. E.g. "English." We can use tools such as Google's Ngrams to determine frequency of use and keep the "most often used" proper nouns.

Edit: the OP topic is for removing "instance hyponyms." (I'm new to the terminology but learning.) So we're not talking about removing ALL proper nouns right, just a subset of them? Would "English" be considered an instance hyponym of "language?" If so, I'd revise the above to state "there are core instance hyponyms that should be kept in the main xml file."

jmccrae · 2025-01-17T09:37:16Z

There are "core" proper nouns that are used often enough to justify keeping them in the standard xml file. E.g. "English." We can use tools such as Google's Ngrams to determine frequency of use and keep the "most often used" proper nouns.

Frequency is difficult to use as a criterion here. Firstly, it has hard to decide where to draw the line. Secondly, many terms have different senses. For example 'Smith' is the lemma of quite a large number of terms:

https://en-word.net/lemma/Smith

Most of these are probably too obscure to be included.

My opinion is the cleanest is just to remove proper nouns entirely from the core resource

Edit: the OP topic is for removing "instance hyponyms." (I'm new to the terminology but learning.) So we're not talking about removing ALL proper nouns right, just a subset of them? Would "English" be considered an instance hyponym of "language?" If so, I'd revise the above to state "there are core instance hyponyms that should be kept in the main xml file."

I see "instance hyponyms" and "proper nouns" as the same thing

Instance Hyponym: A relation between two concepts where concept A (instance_hyponym) is a type of concept B (instance_hypernym), and where A is an individual entity
Proper Noun: A proper noun is a noun that identifies a single entity

This is not currently the case in the model, with many proper nouns being non-instance hyponyms (such as 'English'). I think we should change the relation types in this case to instance hypernymy.

jmccrae mentioned this issue Oct 25, 2024

Some incorrect instance hyponyms #1127

Open

jmccrae added this to the 2025 Release milestone Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release Proper Nouns as Separate WordNet #1126

Release Proper Nouns as Separate WordNet #1126

jmccrae commented Oct 25, 2024

goodmami commented Oct 31, 2024

jmccrae commented Nov 29, 2024

arademaker commented Nov 29, 2024

goodmami commented Nov 29, 2024

jmccrae commented Dec 2, 2024

goodmami commented Dec 3, 2024

fcbond commented Dec 4, 2024 via email

jmccrae commented Dec 4, 2024

fcbond commented Dec 4, 2024

jmccrae commented Dec 4, 2024 via email

goodmami commented Dec 5, 2024

rob-ross commented Jan 10, 2025 •

edited

Loading

jmccrae commented Jan 13, 2025

rob-ross commented Jan 15, 2025 •

edited

Loading

jmccrae commented Jan 17, 2025

Release Proper Nouns as Separate WordNet #1126

Release Proper Nouns as Separate WordNet #1126

Comments

jmccrae commented Oct 25, 2024

goodmami commented Oct 31, 2024

jmccrae commented Nov 29, 2024

arademaker commented Nov 29, 2024

goodmami commented Nov 29, 2024

jmccrae commented Dec 2, 2024

goodmami commented Dec 3, 2024

fcbond commented Dec 4, 2024 via email

jmccrae commented Dec 4, 2024

fcbond commented Dec 4, 2024

jmccrae commented Dec 4, 2024 via email

goodmami commented Dec 5, 2024

rob-ross commented Jan 10, 2025 • edited Loading

jmccrae commented Jan 13, 2025

rob-ross commented Jan 15, 2025 • edited Loading

jmccrae commented Jan 17, 2025

rob-ross commented Jan 10, 2025 •

edited

Loading

rob-ross commented Jan 15, 2025 •

edited

Loading