-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update WDPA processing #12
base: main
Are you sure you want to change the base?
Conversation
At this point we probably have to accept that with the protected areas data, the process can never be reproducible (we can't store and re-distribute a static dataset, and they update their dataset every few months), so this PR is really just to provide a smoother method to keep up-to-date. To be even more clever, we could scrape the XML of the online datastore: "http://d1gam3xoknrgr2.cloudfront.net/" to get the URL to the latest dataset. But this is a bit of a hassle... |
code snippet on how XML scraping might work to give available date options: import requests
import xmltodict
import re
search_string = "current/WDPA_([\w]+)_Public_shp.zip"
r = requests.get("http://d1gam3xoknrgr2.cloudfront.net/")
doc = xmltodict.parse(r.content)
contents = doc['ListBucketResult']['Contents']
date_options = set()
for content in contents:
search_result = re.search(search_string, content["Key"])
if search_result is not None:
date_options.update(search_result.groups()) This would currently return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but needs a few more changes. Unfortunately, this dataset is really annoying. Thanks for the update.
Sorry, Friday afternoon fever - lots of errors left over in this implementation that I should have spotted sooner (you know, buy doing some simple checks like running it locally...)! I'll make sure it works locally and then let you know when it's ready to look at again :) |
#11 anyone? |
@timtroendle This is now fixed up and works locally. The one thing that is a bit annoying is the error catching: merging the files produces thousands of warnings and occasional at the moment (which don't obviously affect information in columns of interest to us). These then stop snakemake in its tracks (it's in 'bash strict mode'). So, I switch off error catching with |
Hmm I am not sure to understand. Are you saying that ogrmerge throws errors and that leads Snakemake to crash? That would not be good. |
It does throw errors, but not ones that are easy to debug, the output looks like this:
The workflow continues unhindered, however, so whatever issue ogrmerge is having doesn't seem to be a problem. However, I could imagine updating this to run in python, such that geopandas handles the merging. Then it would be easier to keep the relevant columns and to debug any issues? |
I'd feel more comfortable that way. I do not really understand what's going on here. The merging only, without anything else, is a three-liner or so. Plus, it would allow us to move from shape files to geopackage. |
@timtroendle I've updated this to process the shapefiles purely in python. I've pythonified a bunch of other steps too, which takes advantage of needing to open these shapefiles anyway and having the functionality to clean and limit the scope of the GeoDataFrame using methods in |
- Use snakemake 'scripts' to run python script - Rely on renewablepotentialslib for shape helper functions
@timtroendle is there anything holding this up, or shall I merge? |
config/schema.yaml
Outdated
@@ -8,6 +8,9 @@ properties: | |||
type: number | |||
enum: [2006, 2010, 2013, 2016, 2021] | |||
description: Indicates the reference NUTS year | |||
wdpa-version: | |||
type: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could check for the form MMMYYYY
, but maybe that's just too restrictive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, who knows how they'll choose to do versioning in future...
original_crs = points.crs | ||
# convert points to circles | ||
points_in_metres = points.to_crs("epsg:3035") | ||
points_in_metres.geometry = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call points_in_metres.buffer(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_features | ||
update_features, | ||
estimate_polygons_from_points, | ||
_radius_meter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to test this private function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why testing private functions is any less necessary than public ones. Maybe this one in particular doesn't need testing, but there's no harm in it, right?
lib/tests/test_shape_utils.py
Outdated
} | ||
point_gdf = gpd.GeoDataFrame( | ||
# x = longitude, y = latitude | ||
geometry=gpd.points_from_xy([i[1] for i in points.values()], [i[0] for i in points.values()]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is geometry=gpd.points_from_xy(zip(*points.values()))
,. no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't quite, because coordinates are usually communicated as [lat, lon]
, but [x, y]
is equivalent to [lon, lat]
. Anyway, I'll update to be less verbose
rules/data-preprocessing.smk
Outdated
params: | ||
version = config["parameters"]["wdpa-version"] | ||
output: temp("build/raw-wdpa") | ||
conda: "../envs/default.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conda env is useless here.
rules/data-preprocessing.smk
Outdated
@@ -295,15 +299,8 @@ rule protected_areas_in_europe: | |||
bounds = "{x_min},{y_min},{x_max},{y_max}".format(**config["scope"]["bounds"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bounds are unused now.
rules/data-preprocessing.smk
Outdated
@@ -268,25 +286,11 @@ rule slope_in_europe: | |||
""" | |||
|
|||
|
|||
rule protected_areas_points_to_circles: | |||
message: "Estimate shape of protected areas available as points only." | |||
rule protected_areas_in_europe_rasterised: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming scheme within this repository was such that xx_in_europe
indicated a rasterised dataset with same bounds and resolution. I acknowledge this is not a great naming scheme, but I'd stick to it or update it everywhere.
rules/data-preprocessing.smk
Outdated
unzip {input} *.zip -d {output} | ||
unzip -o {output}/WDPA_{params.version}_Public_shp_0.zip -d {output}/WDPA_to_merge_0 | ||
unzip -o {output}/WDPA_{params.version}_Public_shp_1.zip -d {output}/WDPA_to_merge_1 | ||
unzip -o {output}/WDPA_{params.version}_Public_shp_2.zip -d {output}/WDPA_to_merge_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach seems brittle: what if the dataset will contain 4 zips in the future? Can we make this a loop instead?
|
||
@pytest.fixture | ||
def estimated_polygons(self, points): | ||
return estimate_polygons_from_points(points, "REP_AREA") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This belongs into the test.
https://docs.pytest.org/en/stable/fixture.html#what-fixtures-are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I would say this is part of the arranging process. As given in the example of what fixtures are, the calls to the class being tested are all part of fixtures. This is the same process I'm undertaking here. I then only assert in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see it that way. estimate_polygons_from_points
is the behaviour/functionality you are testing, so it belongs into act
rather than arrange
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, ok. I can remove it, but it makes sense as a fixture as it has three tests associated with the same function call. Better to do that once than thrice...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I checked the pytest doc a little longer and it seems we are both partly right according to their definitions:
In pytest, “fixtures” are functions you define that serve this purpose. But they don’t have to be limited to just the arrange steps. They can provide the act step, as well, and this can be a powerful technique for designing more complex tests, especially given how pytest’s fixture system works. But we’ll get into that further down.
See this example: https://docs.pytest.org/en/stable/fixture.html#fixtures-can-be-requested-more-than-once-per-test-return-values-are-cached
lib/tests/test_shape_utils.py
Outdated
assert np.isclose(calculated_radius, radius) | ||
|
||
def test_estimate_polygons_area(self, points, estimated_polygons): | ||
assert estimated_polygons.crs == points.crs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually there's one assert
per test. This assert
here seems unnecessary: you are testing this in the test below!
Public
in string) and structures (three separate shapes that need merging)