Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release Proper Nouns as Separate WordNet #1126

Open
jmccrae opened this issue Oct 25, 2024 · 15 comments
Open

Release Proper Nouns as Separate WordNet #1126

jmccrae opened this issue Oct 25, 2024 · 15 comments
Milestone

Comments

@jmccrae
Copy link
Member

jmccrae commented Oct 25, 2024

I propose that starting with the 2025 release, we remove all the instance hyponyms from the release. These would be made in a separate release that is closely connected to Wikidata. I will complete a manual linking between Wikidata and OEWN to enable this.

This issue tracks the progress on this issue.

Any advice on how best to do this would be welcome.

@goodmami
Copy link
Member

Any advice on how best to do this would be welcome.

Not sure about "best", but one option is a WN-LMF 1.1+ lexicon extension. We have also discussed creating ILIs that are not CILI for domain-specific identifiers.

@jmccrae
Copy link
Member Author

jmccrae commented Nov 29, 2024

I propose that the 2025 release consists of three separate versions (XML files)

  • Open English Wordnet (oewn:2025) - all common nouns, verbs, adjectives and verbs (~100k synsets)
  • Open English Termnet Mini (oewn:2025-terms-mini) - roughly the same as the current release (~120k synsets)
  • Open English Termnet Full (oewn:2025-terms-full) - Adding entities from Wikidata such as people, locations, etc. (~20M synsets)

I like @goodmami's suggestion that we use lexicon extensions, but also I wonder if just releasing these as a single stand-alone file would not be easier.

For domain-specific identifiers, we will use Wikidata, as we already have some coverage of these Q-identifiers.

@jmccrae jmccrae added this to the 2025 Release milestone Nov 29, 2024
@arademaker
Copy link
Member

Regarding wikidata links, the current plan seems to be map synsets to Q items. What about the lexical items from Wikidata? I believe this is a space we can also contribute.

@goodmami
Copy link
Member

For the ili, maybe we can just use the Q identifiers from Wikidata instead of assigning new CILI ids for each?

I like @goodmami's suggestion that we use lexicon extensions, but also I wonder if just releasing these as a single stand-alone file would not be easier.

For lexicographers, I think the only non-trivial part of producing an extension is setting up external entities if you need to link things to the original lexicon.

For users, however, one downside is that currently Wn is, as far as I know, the only software package that supports lexicon extensions.

@jmccrae
Copy link
Member Author

jmccrae commented Dec 2, 2024

For the ili, maybe we can just use the Q identifiers from Wikidata instead of assigning new CILI ids for each?

Yes, we should probably start to use Q IDs, at least for any proper nouns.

I like @goodmami's suggestion that we use lexicon extensions, but also I wonder if just releasing these as a single stand-alone file would not be easier.

For lexicographers, I think the only non-trivial part of producing an extension is setting up external entities if you need to link things to the original lexicon.

For users, however, one downside is that currently Wn is, as far as I know, the only software package that supports lexicon extensions.

I suspect that technically it would be easier for users to have one big file, the extension modelling is quite tricky and doesn't make the file much smaller.

@goodmami
Copy link
Member

goodmami commented Dec 3, 2024

I suspect that technically it would be easier for users to have one big file, the extension modelling is quite tricky and doesn't make the file much smaller.

Fair point. A reduced file size isn't the main point, but in any case the external entities do not get created in the Wn database as they should already exist from the original lexicon; their only purpose is so the XML IDREFs work within the file.

A more crucial use-case is if we have more than one additional dataset. This could be if we wanted to break up Wikidata into separate files for people, places, etc., or if we have another source entirely. The more there are, the less appealing it is to compile with/without single-file variants of OEWN for all combinations.

It does seem convenient to have pre-compiled versions of the likely important combinations.

@fcbond
Copy link
Member

fcbond commented Dec 4, 2024 via email

@jmccrae
Copy link
Member Author

jmccrae commented Dec 4, 2024

Hi, I think it is a good idea to use the Q IDs, and maybe even to allow more (like geonames identifiers).

One of the advantages of linking to Wikidata is that Geonames and other databases are then linked from this. As most of the locations in WordNet can be mapped to Wikidata, I don't think we will need to link to two databases. For the few that aren't linked, I will probably try to create Wikidata pages with links to Geonames (like Sealyham which has about 20 houses).

In practice, I am not sure how we would manage this. I don't think we can use the same field, as there will be some words that have both an ILI and a Q ID, and people should be able to access them with either. Maybe we have a small set (start with ILI and Q IDS) and distinguish them with the first letter (i or q)? In the wordnet database, presumably they would then have separate fields, and I guess also in the schema?

Yes, it may be a good idea to add a wikidata field to the schema as we have an ILI field. This is essentially what we are doing in OEWN internally.

@fcbond
Copy link
Member

fcbond commented Dec 4, 2024

I want to link to geonames so that we can access locations not in wordnet :-). So for your goal of removing instances, it is fine, but for the longer goal of covering everything everywhere, I would like to plan ahead.

@jmccrae
Copy link
Member Author

jmccrae commented Dec 4, 2024 via email

@goodmami
Copy link
Member

goodmami commented Dec 5, 2024

@goodmami How hard would it be to create a merged DB by loading an
extension and then dumping the combined lexicon? [...]

Hmm, probably not as hard as exporting an extension (which seemed non-trivial when I thought about it a few years ago: goodmami/wn#103), but it would still require some code changes.

@rob-ross
Copy link

rob-ross commented Jan 10, 2025

Has anyone thought about adding a new attribute to explicitly mark proper nouns, adjectives and adverbs? E.g.:

<LexicalEntry id="oewn-Japanese-a">
  <Lemma writtenForm="Japanese" partOfSpeech="a" isProper="True">
    <Pronunciation>ˌdʒæpəˈniːz</Pronunciation>
  </Lemma>

...

I'm a software dev, and experience teaches me it's better to be explicit when possible as opposed to implicitly determining something from some other attribute, such as "proper nouns are Lemmas without hyphens that are capitalized". Except for the exceptions... :)

This attribute could also go in the LexicalEntry tag. A simpler approach would be to introduce a new partOfSpeech character for proper nouns, but I suppose this would require a lot of changes to existing code so it's probably not feasible? You could probably automate back-filling the new attribute with the correct value and handle the exceptions with some manual analysis.

Then again, I'm guessing the universe of possible proper nouns is larger than the current word count, so I do see value in moving proper words to a separate "expansion pack." But if you could tag proper words in the main xml file, tools could use this to filter them out if not wanted.

Just my suggestion. :)

  • Rob

@jmccrae
Copy link
Member Author

jmccrae commented Jan 13, 2025

I agree that 'proper noun' should probably be its own category, we cannot use captialization as there are difficult cases like 'A-bomb'. We can either do this with an extra property as proposed above (which is similar to how we already model postpositive adjectives) or another option is to create a new lexfile for proper nouns.

@rob-ross
Copy link

rob-ross commented Jan 15, 2025

There are "core" proper nouns that are used often enough to justify keeping them in the standard xml file. E.g. "English." We can use tools such as Google's Ngrams to determine frequency of use and keep the "most often used" proper nouns.

Edit: the OP topic is for removing "instance hyponyms." (I'm new to the terminology but learning.) So we're not talking about removing ALL proper nouns right, just a subset of them? Would "English" be considered an instance hyponym of "language?" If so, I'd revise the above to state "there are core instance hyponyms that should be kept in the main xml file."

@jmccrae
Copy link
Member Author

jmccrae commented Jan 17, 2025

There are "core" proper nouns that are used often enough to justify keeping them in the standard xml file. E.g. "English." We can use tools such as Google's Ngrams to determine frequency of use and keep the "most often used" proper nouns.

Frequency is difficult to use as a criterion here. Firstly, it has hard to decide where to draw the line. Secondly, many terms have different senses. For example 'Smith' is the lemma of quite a large number of terms:

https://en-word.net/lemma/Smith

Most of these are probably too obscure to be included.

My opinion is the cleanest is just to remove proper nouns entirely from the core resource

Edit: the OP topic is for removing "instance hyponyms." (I'm new to the terminology but learning.) So we're not talking about removing ALL proper nouns right, just a subset of them? Would "English" be considered an instance hyponym of "language?" If so, I'd revise the above to state "there are core instance hyponyms that should be kept in the main xml file."

I see "instance hyponyms" and "proper nouns" as the same thing

Instance Hyponym: A relation between two concepts where concept A (instance_hyponym) is a type of concept B (instance_hypernym), and where A is an individual entity
Proper Noun: A proper noun is a noun that identifies a single entity

This is not currently the case in the model, with many proper nouns being non-instance hyponyms (such as 'English'). I think we should change the relation types in this case to instance hypernymy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants