Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update WDPA processing #12

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Update WDPA processing #12

wants to merge 14 commits into from

Conversation

brynpickering
Copy link
Member

  • Add WDPA-date config param
  • Add schema
  • Handle latest batch of WDPA file names (incl Public in string) and structures (three separate shapes that need merging)

@brynpickering
Copy link
Member Author

At this point we probably have to accept that with the protected areas data, the process can never be reproducible (we can't store and re-distribute a static dataset, and they update their dataset every few months), so this PR is really just to provide a smoother method to keep up-to-date.

To be even more clever, we could scrape the XML of the online datastore: "http://d1gam3xoknrgr2.cloudfront.net/" to get the URL to the latest dataset. But this is a bit of a hassle...

@brynpickering
Copy link
Member Author

code snippet on how XML scraping might work to give available date options:

import requests
import xmltodict
import re

search_string = "current/WDPA_([\w]+)_Public_shp.zip"
r = requests.get("http://d1gam3xoknrgr2.cloudfront.net/")
doc = xmltodict.parse(r.content)
contents = doc['ListBucketResult']['Contents']
date_options = set()
for content in contents:
    search_result = re.search(search_string, content["Key"])
    if search_result is not None:
         date_options.update(search_result.groups())

This would currently return {'Mar2021'}

Copy link
Member

@timtroendle timtroendle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but needs a few more changes. Unfortunately, this dataset is really annoying. Thanks for the update.

rules/data-preprocessing.smk Outdated Show resolved Hide resolved
rules/data-preprocessing.smk Outdated Show resolved Hide resolved
rules/data-preprocessing.smk Show resolved Hide resolved
config/schema.yaml Outdated Show resolved Hide resolved
rules/data-preprocessing.smk Outdated Show resolved Hide resolved
rules/data-preprocessing.smk Outdated Show resolved Hide resolved
rules/data-preprocessing.smk Outdated Show resolved Hide resolved
rules/data-preprocessing.smk Outdated Show resolved Hide resolved
rules/data-preprocessing.smk Outdated Show resolved Hide resolved
@brynpickering
Copy link
Member Author

Sorry, Friday afternoon fever - lots of errors left over in this implementation that I should have spotted sooner (you know, buy doing some simple checks like running it locally...)!

I'll make sure it works locally and then let you know when it's ready to look at again :)

@timtroendle
Copy link
Member

#11 anyone?

@brynpickering
Copy link
Member Author

@timtroendle This is now fixed up and works locally.

The one thing that is a bit annoying is the error catching: merging the files produces thousands of warnings and occasional at the moment (which don't obviously affect information in columns of interest to us). These then stop snakemake in its tracks (it's in 'bash strict mode'). So, I switch off error catching with ogrmerge and rely on the fact that if there is a problem it will be in the form of the outputs not being successfully generated. Do you think this is the cleanest way of doing it?

@timtroendle
Copy link
Member

Hmm I am not sure to understand. Are you saying that ogrmerge throws errors and that leads Snakemake to crash? That would not be good.

@brynpickering
Copy link
Member Author

It does throw errors, but not ones that are easy to debug, the output looks like this:

Warning 1: Value 555510160 of field WDPAID of feature 115190 not successfully written. Possibly due to too larger number with respect to field width
Warning 1: Value 555510162 of field WDPAID of feature 115191 not successfully written. Possibly due to too larger number with respect to field width
  • 998 more lines of much the same
    Then:
More than 1000 errors or warnings have been reported. No more will be reported from now.

The workflow continues unhindered, however, so whatever issue ogrmerge is having doesn't seem to be a problem. However, I could imagine updating this to run in python, such that geopandas handles the merging. Then it would be easier to keep the relevant columns and to debug any issues?

@timtroendle
Copy link
Member

However, I could imagine updating this to run in python, such that geopandas handles the merging.

I'd feel more comfortable that way. I do not really understand what's going on here. The merging only, without anything else, is a three-liner or so. Plus, it would allow us to move from shape files to geopackage.

@brynpickering
Copy link
Member Author

@timtroendle I've updated this to process the shapefiles purely in python. I've pythonified a bunch of other steps too, which takes advantage of needing to open these shapefiles anyway and having the functionality to clean and limit the scope of the GeoDataFrame using methods in administrative_borders.py.

brynpickering added 4 commits April 8, 2021 09:45
@brynpickering
Copy link
Member Author

@timtroendle is there anything holding this up, or shall I merge?

@@ -8,6 +8,9 @@ properties:
type: number
enum: [2006, 2010, 2013, 2016, 2021]
description: Indicates the reference NUTS year
wdpa-version:
type: string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could check for the form MMMYYYY, but maybe that's just too restrictive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, who knows how they'll choose to do versioning in future...

original_crs = points.crs
# convert points to circles
points_in_metres = points.to_crs("epsg:3035")
points_in_metres.geometry = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call points_in_metres.buffer(...)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_features
update_features,
estimate_polygons_from_points,
_radius_meter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to test this private function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why testing private functions is any less necessary than public ones. Maybe this one in particular doesn't need testing, but there's no harm in it, right?

}
point_gdf = gpd.GeoDataFrame(
# x = longitude, y = latitude
geometry=gpd.points_from_xy([i[1] for i in points.values()], [i[0] for i in points.values()]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is geometry=gpd.points_from_xy(zip(*points.values())),. no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't quite, because coordinates are usually communicated as [lat, lon], but [x, y] is equivalent to [lon, lat]. Anyway, I'll update to be less verbose

params:
version = config["parameters"]["wdpa-version"]
output: temp("build/raw-wdpa")
conda: "../envs/default.yaml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conda env is useless here.

@@ -295,15 +299,8 @@ rule protected_areas_in_europe:
bounds = "{x_min},{y_min},{x_max},{y_max}".format(**config["scope"]["bounds"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bounds are unused now.

@@ -268,25 +286,11 @@ rule slope_in_europe:
"""


rule protected_areas_points_to_circles:
message: "Estimate shape of protected areas available as points only."
rule protected_areas_in_europe_rasterised:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming scheme within this repository was such that xx_in_europe indicated a rasterised dataset with same bounds and resolution. I acknowledge this is not a great naming scheme, but I'd stick to it or update it everywhere.

unzip {input} *.zip -d {output}
unzip -o {output}/WDPA_{params.version}_Public_shp_0.zip -d {output}/WDPA_to_merge_0
unzip -o {output}/WDPA_{params.version}_Public_shp_1.zip -d {output}/WDPA_to_merge_1
unzip -o {output}/WDPA_{params.version}_Public_shp_2.zip -d {output}/WDPA_to_merge_2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems brittle: what if the dataset will contain 4 zips in the future? Can we make this a loop instead?


@pytest.fixture
def estimated_polygons(self, points):
return estimate_polygons_from_points(points, "REP_AREA")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I would say this is part of the arranging process. As given in the example of what fixtures are, the calls to the class being tested are all part of fixtures. This is the same process I'm undertaking here. I then only assert in the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it that way. estimate_polygons_from_points is the behaviour/functionality you are testing, so it belongs into act rather than arrange.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok. I can remove it, but it makes sense as a fixture as it has three tests associated with the same function call. Better to do that once than thrice...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I checked the pytest doc a little longer and it seems we are both partly right according to their definitions:

In pytest, “fixtures” are functions you define that serve this purpose. But they don’t have to be limited to just the arrange steps. They can provide the act step, as well, and this can be a powerful technique for designing more complex tests, especially given how pytest’s fixture system works. But we’ll get into that further down.

See this example: https://docs.pytest.org/en/stable/fixture.html#fixtures-can-be-requested-more-than-once-per-test-return-values-are-cached

assert np.isclose(calculated_radius, radius)

def test_estimate_polygons_area(self, points, estimated_polygons):
assert estimated_polygons.crs == points.crs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually there's one assert per test. This assert here seems unnecessary: you are testing this in the test below!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants