Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADM1 and ADM2 shape names contain ? (UTF-8 encoding issue?) #22

Open
2238154 opened this issue Nov 1, 2023 · 6 comments
Open

ADM1 and ADM2 shape names contain ? (UTF-8 encoding issue?) #22

2238154 opened this issue Nov 1, 2023 · 6 comments

Comments

@2238154
Copy link

2238154 commented Nov 1, 2023

Hello,

I have accessed the SSCGS data across all admin levels for all countries and have noticed that 126 ADM1 regions contain one or multiple ? within their names (in some cases the whole name comprises of question marks).

These are specifically found in countries: AZE, CZE, DZA, HRV, HUN, JOR, MLT, MNE, POL, RUS, SVK, TKM, TUR and YEM.

The same issue also exists in 3551 ADM2 regions of countries: AZE, BIH, CZE, DZA, FIN, GNB, HRV, HUN, HTI, IRL, JOR, JPN, KAZ, LTU, MAR, MLT, MNE, NOR, POL, RUS, SOM, SVK, TUR and YEM.

I checked to see if this was the case in HPSCU, HPSCGS and SSCU data, and sadly the same issue exists although i have not checked if its the exact same regions that have their names modified.

I haven't been able to fix this by enabling UTF-8 encoding and the issue is not present on the files that can be downloaded directly from the geoboundaries website. I considered using those files to substitute the modified names with the correct ones but as the shapeID differs between the website files and the rgeoboundary files this is not possible.

For my purpose of use the rgeoboundary files are excellent as they inform me which is the parent ADM1 and AMD0 region for each ADM2 region unlike the website files.

If this issue could be looked into and corrected I would be very grateful.

@DanRunfola
Copy link
Member

@rohith4444 When you get back, can you take a look at this?

@2238154 Agreed this must be an encoding issue, but I wonder if this is something with our most recent data. When you say "the files that can be downloaded directly from the geoboundaries website", do you mean the GUI from geoboundaries.org? Those files should be identical to the files that can be retrieved through the R API now, so I'm not sure why you would be seeing different shapeIDs.

Seperately, one thing I'm working on is giving users the ability to specify a version of geoBoundaries to "lock" the files they get - right now, the R api returns the most recent boundaries each call. Coming soon, hopefully.

@dickoa
Copy link
Collaborator

dickoa commented Nov 2, 2023

@DanRunfola for the encoding it might be related to encoding of the shapefiles for cgaz dataset. You can see (inspecting the dbf) that we have some encoding issues, I think it can be fixed upstream. I'm not sure if we have the same issue with GeoPackage or GeoJSON.
I'm also excited about a robust data versioning system. Let me know when can start thinking about including it in the package.

@2238154
Copy link
Author

2238154 commented Nov 3, 2023

Hi @DanRunfola by geoboundaries website I am indeed referring to the files that can be downloaded from geoboundaries.org but not the CGAZ dataset, the global or single country files under the Download Data tab.

Regarding the shapeID difference I mentioned, the difference is essentially that the website files have a numeric string whereas the rgeoboundaries files have a string comprised both of numbers and letters. I have attached a picture for easy visualization of this.

shapeID_difference

@DanRunfola
Copy link
Member

@2238154 whoa! So, the image on the right is our (very old, now) ID scheme, representing layers from geoBoundaires 3.0. The IDs on the left are our current scheme, which are hashes based on the geographies which are (nearly) guaranteed to be unique, and more importantly don't change unless the underlying data change (the problem with the old approach was that if we, for example, pushed from gb 3.0 to 4.0, our IDs changed even if the underlying data didn't). We're working on better ID join systems now, but those are a little ways off (i.e., interoperability with UN P-Codes, among other considerations).

That all aside, I'm surprised you're able to get the data on the right at all any longer - the API that would have returned that was deprecated a bit ago. You can technically still get it directly from our archive (https://github.com/wmgeolab/geoBoundariesArchive_3_0_0/tree/7c8dbc599e312d9204e450aecfa66c204b8cf9b8), but the rgeoboundaries shouldn't work any longer with that release.

To isolate the potential issue here, and just to confirm: are you seeing this issue with the shapes that have the "CUB-ADM-3_0_0-B1" ID pattern? If so, then I think the answer is just going to be a need to upgrade to our more modern releases.

Some other issues for your awareness: our older CGAZ release had a hierarchical administrative hierarchy field; our current release doesn't retain that, but it will likely be reintroduce soon. Other than that, most things are fairly similar, and just generally better (i.e., we fixed many geometry and encoding errors).

@2238154
Copy link
Author

2238154 commented Nov 7, 2023

@DanRunfola Thank you for your active engagement in resolving the encoding issue!
Yes the name issue was for the files with the "CUB-ADM-3_0_0-B1" ID pattern.

I have reinstalled the package to upgrade to the more modern release as follows:
remotes::install_github("wmgeolab/rgeoboundaries", force = TRUE)
library(rgeoboundaries)
test1 <- gb_adm1(type = "sscgs", quiet = TRUE)
test2 <- gb_adm2(type = "sscgs", quiet = TRUE)

The files have now successfully adopted the numeric shapeID pattern "67162791B11100068414031." However, the encoding issue causing names to contain the character "?" persists in 151 ADM1 regions and 2142 ADM2 regions.

In any case, for my intended use of geoboundaries I will require the administrative hierarchy field spanning admin levels 0-2 so in the meanwhile I will patiently await its reintroduction alongside any other optimizations. Is there any approximate timeframe when the reintroduction may take place? My colleagues and I use these shape files for epidemiological data mapping. As we have already collected data and mapped it to the older versions with the "CUB-ADM-3_0_0-B1" ID pattern, we kindly request if it might be feasible to make available a mapping between the new and older shapeIDs. This would greatly facilitate the process of data migration.

@DanRunfola
Copy link
Member

Hi @2238154 - Two followups, and apologies on the delay here, missed this in my inbox.

  1. I would imagine that the hierarchies will be introduced with our 7.0 (next) release, which is normally summer months. We'll see how the semester goes, though - if I get a block of time to hammer this out, it may be sooner.
  2. For crosswalking IDs - this is a much broader challenge we're working on, and not just within geoBoundaries (i.e., we would like to let users crosswalk to any other data sources). The current plan is to offer tabular, joinable files that provide columns of the current gB ID, alongside join-codes for past gB major releases (i.e., 3.0), as well as a few other commonly used ID systems (i.e., ISO codes, maybe P-Codes). Probably a similar timeframe on this (7.0 with the possibility of sooner).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants