-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADM1 and ADM2 shape names contain ? (UTF-8 encoding issue?) #22
Comments
@rohith4444 When you get back, can you take a look at this? @2238154 Agreed this must be an encoding issue, but I wonder if this is something with our most recent data. When you say "the files that can be downloaded directly from the geoboundaries website", do you mean the GUI from geoboundaries.org? Those files should be identical to the files that can be retrieved through the R API now, so I'm not sure why you would be seeing different shapeIDs. Seperately, one thing I'm working on is giving users the ability to specify a version of geoBoundaries to "lock" the files they get - right now, the R api returns the most recent boundaries each call. Coming soon, hopefully. |
@DanRunfola for the encoding it might be related to encoding of the shapefiles for cgaz dataset. You can see (inspecting the dbf) that we have some encoding issues, I think it can be fixed upstream. I'm not sure if we have the same issue with GeoPackage or GeoJSON. |
Hi @DanRunfola by geoboundaries website I am indeed referring to the files that can be downloaded from geoboundaries.org but not the CGAZ dataset, the global or single country files under the Download Data tab. Regarding the shapeID difference I mentioned, the difference is essentially that the website files have a numeric string whereas the rgeoboundaries files have a string comprised both of numbers and letters. I have attached a picture for easy visualization of this. |
@2238154 whoa! So, the image on the right is our (very old, now) ID scheme, representing layers from geoBoundaires 3.0. The IDs on the left are our current scheme, which are hashes based on the geographies which are (nearly) guaranteed to be unique, and more importantly don't change unless the underlying data change (the problem with the old approach was that if we, for example, pushed from gb 3.0 to 4.0, our IDs changed even if the underlying data didn't). We're working on better ID join systems now, but those are a little ways off (i.e., interoperability with UN P-Codes, among other considerations). That all aside, I'm surprised you're able to get the data on the right at all any longer - the API that would have returned that was deprecated a bit ago. You can technically still get it directly from our archive (https://github.com/wmgeolab/geoBoundariesArchive_3_0_0/tree/7c8dbc599e312d9204e450aecfa66c204b8cf9b8), but the rgeoboundaries shouldn't work any longer with that release. To isolate the potential issue here, and just to confirm: are you seeing this issue with the shapes that have the "CUB-ADM-3_0_0-B1" ID pattern? If so, then I think the answer is just going to be a need to upgrade to our more modern releases. Some other issues for your awareness: our older CGAZ release had a hierarchical administrative hierarchy field; our current release doesn't retain that, but it will likely be reintroduce soon. Other than that, most things are fairly similar, and just generally better (i.e., we fixed many geometry and encoding errors). |
@DanRunfola Thank you for your active engagement in resolving the encoding issue! I have reinstalled the package to upgrade to the more modern release as follows: The files have now successfully adopted the numeric shapeID pattern "67162791B11100068414031." However, the encoding issue causing names to contain the character "?" persists in 151 ADM1 regions and 2142 ADM2 regions. In any case, for my intended use of geoboundaries I will require the administrative hierarchy field spanning admin levels 0-2 so in the meanwhile I will patiently await its reintroduction alongside any other optimizations. Is there any approximate timeframe when the reintroduction may take place? My colleagues and I use these shape files for epidemiological data mapping. As we have already collected data and mapped it to the older versions with the "CUB-ADM-3_0_0-B1" ID pattern, we kindly request if it might be feasible to make available a mapping between the new and older shapeIDs. This would greatly facilitate the process of data migration. |
Hi @2238154 - Two followups, and apologies on the delay here, missed this in my inbox.
|
Hello,
I have accessed the SSCGS data across all admin levels for all countries and have noticed that 126 ADM1 regions contain one or multiple ? within their names (in some cases the whole name comprises of question marks).
These are specifically found in countries: AZE, CZE, DZA, HRV, HUN, JOR, MLT, MNE, POL, RUS, SVK, TKM, TUR and YEM.
The same issue also exists in 3551 ADM2 regions of countries: AZE, BIH, CZE, DZA, FIN, GNB, HRV, HUN, HTI, IRL, JOR, JPN, KAZ, LTU, MAR, MLT, MNE, NOR, POL, RUS, SOM, SVK, TUR and YEM.
I checked to see if this was the case in HPSCU, HPSCGS and SSCU data, and sadly the same issue exists although i have not checked if its the exact same regions that have their names modified.
I haven't been able to fix this by enabling UTF-8 encoding and the issue is not present on the files that can be downloaded directly from the geoboundaries website. I considered using those files to substitute the modified names with the correct ones but as the shapeID differs between the website files and the rgeoboundary files this is not possible.
For my purpose of use the rgeoboundary files are excellent as they inform me which is the parent ADM1 and AMD0 region for each ADM2 region unlike the website files.
If this issue could be looked into and corrected I would be very grateful.
The text was updated successfully, but these errors were encountered: