-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release Proper Nouns as Separate WordNet #1126
Comments
Not sure about "best", but one option is a WN-LMF 1.1+ lexicon extension. We have also discussed creating ILIs that are not CILI for domain-specific identifiers. |
I propose that the 2025 release consists of three separate versions (XML files)
I like @goodmami's suggestion that we use lexicon extensions, but also I wonder if just releasing these as a single stand-alone file would not be easier. For domain-specific identifiers, we will use Wikidata, as we already have some coverage of these Q-identifiers. |
Regarding wikidata links, the current plan seems to be map synsets to Q items. What about the lexical items from Wikidata? I believe this is a space we can also contribute. |
For the
For lexicographers, I think the only non-trivial part of producing an extension is setting up external entities if you need to link things to the original lexicon. For users, however, one downside is that currently Wn is, as far as I know, the only software package that supports lexicon extensions. |
Yes, we should probably start to use Q IDs, at least for any proper nouns.
I suspect that technically it would be easier for users to have one big file, the extension modelling is quite tricky and doesn't make the file much smaller. |
Fair point. A reduced file size isn't the main point, but in any case the external entities do not get created in the Wn database as they should already exist from the original lexicon; their only purpose is so the XML IDREFs work within the file. A more crucial use-case is if we have more than one additional dataset. This could be if we wanted to break up Wikidata into separate files for people, places, etc., or if we have another source entirely. The more there are, the less appealing it is to compile with/without single-file variants of OEWN for all combinations. It does seem convenient to have pre-compiled versions of the likely important combinations. |
Hi,
I think it is a good idea to use the Q IDs, and maybe even to allow more
(like geonames identifiers). In practice, I am not sure how we would
manage this.
I don't think we can use the same field, as there will be some words that
have both an ILI and a Q ID, and people should be able to access them with
either. Maybe we have a small set (start with ILI and Q IDS) and
distinguish them with the first letter (i or q)? In the wordnet database,
presumably they would then have separate fields, and I guess also in the
schema? So you can have an ili and a Qid? I worry that if we end up with
too many, then it will be messy, but I guess in practice we don't expect
many more? The candidates I would think of are geonames and the species
database, but I don't know if they are completely subsumed by wikidata,
... A quick look online suggests that wikidata links around 20% of
geonames, so maybe still worth having as separate.
When we built the geonames wordnet, we anticipated people using subsets of
it (an overall common, and then one or more region specific DBs) as
otherwise it is very big.
@goodmami How hard would it be to create a merged DB by loading an
extension and then dumping the combined lexicon? Then we could easily have
separate lexions and also create a few commonly used combinations, as you
suggested.
…On Tue, 3 Dec 2024 at 02:00, Michael Wayne Goodman ***@***.***> wrote:
I suspect that technically it would be easier for users to have one big
file, the extension modelling is quite tricky and doesn't make the file
much smaller.
Fair point. A reduced file size isn't the main point, but in any case the
external entities do not get created in the Wn database as they should
already exist from the original lexicon; their only purpose is so the XML
IDREFs work within the file.
A more crucial use-case is if we have more than one additional dataset.
This could be if we wanted to break up Wikidata into separate files for
people, places, etc., or if we have another source entirely. The more there
are, the less appealing it is to compile with/without single-file variants
of OEWN for all combinations.
It does seem convenient to have pre-compiled versions of the likely
important combinations.
—
Reply to this email directly, view it on GitHub
<#1126 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRVGEYODXMCO2XSZRAD2DT7EHAVCNFSM6AAAAABQTFFBPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJTGMYDSNZZGU>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Francis Bond <https://fcbond.github.io/>
|
One of the advantages of linking to Wikidata is that Geonames and other databases are then linked from this. As most of the locations in WordNet can be mapped to Wikidata, I don't think we will need to link to two databases. For the few that aren't linked, I will probably try to create Wikidata pages with links to Geonames (like Sealyham which has about 20 houses).
Yes, it may be a good idea to add a wikidata field to the schema as we have an ILI field. This is essentially what we are doing in OEWN internally. |
I want to link to geonames so that we can access locations not in wordnet :-). So for your goal of removing instances, it is fine, but for the longer goal of covering everything everywhere, I would like to plan ahead. |
There are very few instances that are in GeoNames and not Wikidata and I
think the easier way to capture these is to modify Wikidata, as anything
that is in both Geonames and OEWN should meet their notability requirements
<https://www.wikidata.org/wiki/Wikidata:Notability>.
In fact the list of such entities is entirely islands that are more
commonly referred to as geo-political entities:
Anguilla (island), Bermuda (island), Montserrat (island), British West
Indies (islands), Guadeloupe (islands), Tuvalu (islands), Faroe islands
(islands), Philippine islands (islands), Seychelles (islands).
Regards,
John
Ar Céad 4 Noll 2024 ag 10:17, scríobh Francis Bond ***@***.***
…:
I want to link to geonames so that we can access locations not in wordnet
:-). So for your goal of removing instances, it is fine, but for the longer
goal of covering everything everywhere, I would like to plan ahead.
—
Reply to this email directly, view it on GitHub
<#1126 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAK2VZ7JUSCOXRXR4AOGT732D3JDVAVCNFSM6AAAAABQTFFBPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJWHA2DOMJQGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hmm, probably not as hard as exporting an extension (which seemed non-trivial when I thought about it a few years ago: goodmami/wn#103), but it would still require some code changes. |
I propose that starting with the 2025 release, we remove all the instance hyponyms from the release. These would be made in a separate release that is closely connected to Wikidata. I will complete a manual linking between Wikidata and OEWN to enable this.
This issue tracks the progress on this issue.
Any advice on how best to do this would be welcome.
The text was updated successfully, but these errors were encountered: