-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release Proper Nouns as Separate WordNet #1126
Comments
Not sure about "best", but one option is a WN-LMF 1.1+ lexicon extension. We have also discussed creating ILIs that are not CILI for domain-specific identifiers. |
I propose that the 2025 release consists of three separate versions (XML files)
I like @goodmami's suggestion that we use lexicon extensions, but also I wonder if just releasing these as a single stand-alone file would not be easier. For domain-specific identifiers, we will use Wikidata, as we already have some coverage of these Q-identifiers. |
Regarding wikidata links, the current plan seems to be map synsets to Q items. What about the lexical items from Wikidata? I believe this is a space we can also contribute. |
For the
For lexicographers, I think the only non-trivial part of producing an extension is setting up external entities if you need to link things to the original lexicon. For users, however, one downside is that currently Wn is, as far as I know, the only software package that supports lexicon extensions. |
Yes, we should probably start to use Q IDs, at least for any proper nouns.
I suspect that technically it would be easier for users to have one big file, the extension modelling is quite tricky and doesn't make the file much smaller. |
Fair point. A reduced file size isn't the main point, but in any case the external entities do not get created in the Wn database as they should already exist from the original lexicon; their only purpose is so the XML IDREFs work within the file. A more crucial use-case is if we have more than one additional dataset. This could be if we wanted to break up Wikidata into separate files for people, places, etc., or if we have another source entirely. The more there are, the less appealing it is to compile with/without single-file variants of OEWN for all combinations. It does seem convenient to have pre-compiled versions of the likely important combinations. |
Hi,
I think it is a good idea to use the Q IDs, and maybe even to allow more
(like geonames identifiers). In practice, I am not sure how we would
manage this.
I don't think we can use the same field, as there will be some words that
have both an ILI and a Q ID, and people should be able to access them with
either. Maybe we have a small set (start with ILI and Q IDS) and
distinguish them with the first letter (i or q)? In the wordnet database,
presumably they would then have separate fields, and I guess also in the
schema? So you can have an ili and a Qid? I worry that if we end up with
too many, then it will be messy, but I guess in practice we don't expect
many more? The candidates I would think of are geonames and the species
database, but I don't know if they are completely subsumed by wikidata,
... A quick look online suggests that wikidata links around 20% of
geonames, so maybe still worth having as separate.
When we built the geonames wordnet, we anticipated people using subsets of
it (an overall common, and then one or more region specific DBs) as
otherwise it is very big.
@goodmami How hard would it be to create a merged DB by loading an
extension and then dumping the combined lexicon? Then we could easily have
separate lexions and also create a few commonly used combinations, as you
suggested.
…On Tue, 3 Dec 2024 at 02:00, Michael Wayne Goodman ***@***.***> wrote:
I suspect that technically it would be easier for users to have one big
file, the extension modelling is quite tricky and doesn't make the file
much smaller.
Fair point. A reduced file size isn't the main point, but in any case the
external entities do not get created in the Wn database as they should
already exist from the original lexicon; their only purpose is so the XML
IDREFs work within the file.
A more crucial use-case is if we have more than one additional dataset.
This could be if we wanted to break up Wikidata into separate files for
people, places, etc., or if we have another source entirely. The more there
are, the less appealing it is to compile with/without single-file variants
of OEWN for all combinations.
It does seem convenient to have pre-compiled versions of the likely
important combinations.
—
Reply to this email directly, view it on GitHub
<#1126 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRVGEYODXMCO2XSZRAD2DT7EHAVCNFSM6AAAAABQTFFBPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJTGMYDSNZZGU>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Francis Bond <https://fcbond.github.io/>
|
One of the advantages of linking to Wikidata is that Geonames and other databases are then linked from this. As most of the locations in WordNet can be mapped to Wikidata, I don't think we will need to link to two databases. For the few that aren't linked, I will probably try to create Wikidata pages with links to Geonames (like Sealyham which has about 20 houses).
Yes, it may be a good idea to add a wikidata field to the schema as we have an ILI field. This is essentially what we are doing in OEWN internally. |
I want to link to geonames so that we can access locations not in wordnet :-). So for your goal of removing instances, it is fine, but for the longer goal of covering everything everywhere, I would like to plan ahead. |
There are very few instances that are in GeoNames and not Wikidata and I
think the easier way to capture these is to modify Wikidata, as anything
that is in both Geonames and OEWN should meet their notability requirements
<https://www.wikidata.org/wiki/Wikidata:Notability>.
In fact the list of such entities is entirely islands that are more
commonly referred to as geo-political entities:
Anguilla (island), Bermuda (island), Montserrat (island), British West
Indies (islands), Guadeloupe (islands), Tuvalu (islands), Faroe islands
(islands), Philippine islands (islands), Seychelles (islands).
Regards,
John
Ar Céad 4 Noll 2024 ag 10:17, scríobh Francis Bond ***@***.***
…:
I want to link to geonames so that we can access locations not in wordnet
:-). So for your goal of removing instances, it is fine, but for the longer
goal of covering everything everywhere, I would like to plan ahead.
—
Reply to this email directly, view it on GitHub
<#1126 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAK2VZ7JUSCOXRXR4AOGT732D3JDVAVCNFSM6AAAAABQTFFBPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJWHA2DOMJQGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hmm, probably not as hard as exporting an extension (which seemed non-trivial when I thought about it a few years ago: goodmami/wn#103), but it would still require some code changes. |
Has anyone thought about adding a new attribute to explicitly mark proper nouns, adjectives and adverbs? E.g.:
... I'm a software dev, and experience teaches me it's better to be explicit when possible as opposed to implicitly determining something from some other attribute, such as "proper nouns are Lemmas without hyphens that are capitalized". Except for the exceptions... :) This attribute could also go in the LexicalEntry tag. A simpler approach would be to introduce a new partOfSpeech character for proper nouns, but I suppose this would require a lot of changes to existing code so it's probably not feasible? You could probably automate back-filling the new attribute with the correct value and handle the exceptions with some manual analysis. Then again, I'm guessing the universe of possible proper nouns is larger than the current word count, so I do see value in moving proper words to a separate "expansion pack." But if you could tag proper words in the main xml file, tools could use this to filter them out if not wanted. Just my suggestion. :)
|
I agree that 'proper noun' should probably be its own category, we cannot use captialization as there are difficult cases like 'A-bomb'. We can either do this with an extra property as proposed above (which is similar to how we already model postpositive adjectives) or another option is to create a new lexfile for proper nouns. |
There are "core" proper nouns that are used often enough to justify keeping them in the standard xml file. E.g. "English." We can use tools such as Google's Ngrams to determine frequency of use and keep the "most often used" proper nouns. Edit: the OP topic is for removing "instance hyponyms." (I'm new to the terminology but learning.) So we're not talking about removing ALL proper nouns right, just a subset of them? Would "English" be considered an instance hyponym of "language?" If so, I'd revise the above to state "there are core instance hyponyms that should be kept in the main xml file." |
Frequency is difficult to use as a criterion here. Firstly, it has hard to decide where to draw the line. Secondly, many terms have different senses. For example 'Smith' is the lemma of quite a large number of terms: https://en-word.net/lemma/Smith Most of these are probably too obscure to be included. My opinion is the cleanest is just to remove proper nouns entirely from the core resource
I see "instance hyponyms" and "proper nouns" as the same thing Instance Hyponym: A relation between two concepts where concept A (instance_hyponym) is a type of concept B (instance_hypernym), and where A is an individual entity This is not currently the case in the model, with many proper nouns being non-instance hyponyms (such as 'English'). I think we should change the relation types in this case to instance hypernymy. |
I propose that starting with the 2025 release, we remove all the instance hyponyms from the release. These would be made in a separate release that is closely connected to Wikidata. I will complete a manual linking between Wikidata and OEWN to enable this.
This issue tracks the progress on this issue.
Any advice on how best to do this would be welcome.
The text was updated successfully, but these errors were encountered: