-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wikidata sync script doesn't appear to search for all discrepancies #9972
Comments
I filtered out most of the discrepancies by hand a few days ago. I don't remember the exact numbers, but I think there were around 60 orphaned uses of P8253. IMO it'd still be a good idea to eventually work the checks I described into one of the npm scripts. |
But in this case, if the wikidata reference isn't in the NSI data file, it wouldn't have a point of reference to find and check the Lio entry, so wouldn't be able to remove the custom property or update the data folder, as it wouldn't be able to find the wikidata item in the first place. The only way to find orphaned Wikidata entries that have the NSI property attached to it would be to look for the property within the entire Wikidata database. It would be good if the script could factor in changes on our side though, such as moving a wikidata to a different category, and updating Wikidata acordingly. |
Indeed, and that's where https://www.wikidata.org/wiki/Special:WhatLinksHere/Property:P8253 comes into play. This link is a "clean" version of the results; a raw list that's more tailored for developers and scripts is available through the MediaWiki API (e.g., https://www.wikidata.org/wiki/Special:ApiSandbox#action=query&format=json&list=backlinks&formatversion=2&bltitle=Property%3AP8253, which then needs to be run on a loop to get all links). The results can then be stored and modified by the NSI script in whatever way is needed for the script to compare QID usage.
Very good point. After considering this, I think the most a script could do from our end is find orphaned entries on Wikidata (via the above method) and either 1) remove the property from the Wikidata page or 2) warn about their presence in the same way that the script warns about other problems, like deleted Wikidata items. |
I was working on attaching Wikidata IDs to transit networks when I came across something odd: the Wikidata item (Q55597931) for the French network liO already had the property P8253 (OSM Name Suggestion Index ID) correctly set with liO's current NSI IDs even though the
network:wikidata
tag is not present in the tags for liO indata\transit\route\bus.json
, nor is Q55597931 present indist\wikidata.json
. It turns out that the property was manually added to the Wikidata item back in March, and the fact thatnpm build wikidata
1) does not remove this custom property addition that is essentially isolated from the NSI nor 2) adds the Wikidata QID to the relevant item in thedata
folder upon discovering that it's missing made me curious as to what else is slipping through the cracks.The problems
As a test, I gathered every QID in
dist\wikidata.json
and compared that list with the first 500 QIDs that link to Wikidata property P8253. Even in just the first 500 IDs (out of several thousands), there was one item that linked to P8253 without having an entry indist\wikidata.json
: Q125054. I looked at the item in question, and indeed, the property is present on the item with links to long-obsolete NSI IDs dating back to a time when Aldi had a consolidated entry in the NSI. (The property and IDs were added using the normal method in May 2023, but apparently never removed when the Aldi IDs were split and pointed to other Wikidata items.)Another kind of discrepancy that can exist is one I encountered myself a few weeks back when I adjusted the entries for the bus networks operated by Rochester-Genesee Regional Transportation Authority (RGRTA). The Wikidata QID for RGRTA was moved from
network:wikidata
tooperator:wikidata
as part of the changes, and whilenpm build wikidata
did correctly add the bus network NSI IDs and P8253 to the new Wikidata pages for each individual network, it did not remove the NSI IDs and P8253 from RGRTA's Wikidata item. The IDs on the RGRTA item ultimately had to be manually removed once the code changes went live, as OSM's iD editor somehow got confused by having the same NSI ID for the bus networks on two separate Wikidata items: RGRTA and RTS/RTS Genesee/etc. Until I removed the NSI IDs from the RGRTA Wikidata item, the iD editor kept cycling through the old tag presets from the previous release and the new presets from the then-current release.Similar code changes are awaiting deployment for the Syracuse-based Centro bus operator and its networks, and as of now I expect to have to manually remove the NSI IDs and P8253 from Centro's Wikidata page when the current code is released.
Possible resolution methods
Although the obsolete and duplicate links to NSI IDs on Wikidata could be manually filtered out by comparing the QIDs in
dist\wikidata.json
to those listed in the "What links here" for Wikidata property P8253, I feel that this is a task better suited for a script. There are over 20,000 QIDs indist\wikidata.json
, and several thousand links on Wikidata to P8253, making the comparison a time-consuming task if done by hand. (EDIT: there are about 20,800 QIDs in the former and 20,250 links to the latter.) A complete list of links to P8253 can be retrieved by script using the MediaWiki API, although I can't remember the precise code since it's been many years since I've used that API. (I was an administrator on the English Wikipedia for a few years under a different username until I retired and exercised my right to vanish.)I would suggest adding the above check to
npm run wikidata
after it reads the QIDs present in thedata
folder unless this would break the code on the iD editor for NSI entries that are edited to point from one Wikidata item to another in between releases of the NSI. If this is the case, then I would suggest making the check part ofnpm run dist
.The text was updated successfully, but these errors were encountered: