Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/region support #326

Open
wants to merge 33 commits into
base: develop
Choose a base branch
from

Conversation

blychs
Copy link
Collaborator

@blychs blychs commented Dec 31, 2024

Adds new region support in the form of a specific utility (util/region_select.py). The docs are updated accordingly.
This includes:
Old capabilities, but using xarray instead of pandas (and the .where method instead of a query). The query is still and option in the utility, but due to how I change the driver, I don't see how it would really be required by the code. xarray has a "query" method, but I couldn't make it work and I believe it requires the use of a dimension to query along, which I don't think we want.

New, advanced region support.
auto-region now includes auto-region:custom, allowing the user to provide a lonlat box. It is currently somewhat limited, though, and it cannot cross the antimeridian. The box has to be provided in the new keyword domain_info.

New, advanced region support with regionmask. These require the use of the new keyword domain_info. The new capabilities include:

  • Defining one or more polygons with custom:auto_polygon in domain_type. Holes in the polygon are permitted by regionmask, but I have not added that capability yet (I am not sure about the best way to do it, since it would require turning things into another dictionary, I believe).
  • Using regionmask's defined_regions method with custom:defined_regions. These need to be defined, once again, in domain_info.
  • Using a shapefile/geojson with custom:custom_file. The path or URL of the file need to be defined in domain_info. There are a few undesirable thing about how I did the automatic download, though, and if you have better suggestions I'd be happy to change the code:
      - I am not using pooch, but downloading the files manually. The reason for this is that I was not able to find a way to tell pooch to use the content-disposition keyword of a URL when downloading, leading to errors when the URL did not end with the name of the file. I'd rather not make the user provide that, since it's not always certain. I'm sure that there must be a way, but I could not find it.
      - The code does not check if the file exists, and just overwrites it. If you ask for multiple tasks using the same domain, it downloads it again every time. This is silly, and could be avoided by using pooch, which would also test the checksum. Once again, I could not find an easy way to avoid this when the URL does not end with the appropriate file name, which happens quite often (for example, in my tests).

If you have any solutions for this, I would appreciate them. Otherwise, I'd suggest moving forward with this.
Please check also the changes in the docs. My English is far from perfect and some proofreading (and corrections!) by people with a better English than me would be great.

I am uploading here the yaml file I used to test all of this, so that it can get tested again. I tested it against surface data. Although I didn't do a complete test against TEMPO (I didn't want to add changes to the driver until I fix the merge conflicts, hopefully tomorrow) I did test the individual functions and plotted the results. Testing against AEROMMA data for aircraft would be great. Otherwise, I can try it with ASIA-AQ. Since GitHub does not allow me
I also provide here my functions for testing individual options. I did not build proper unit tests, since just asserting whether the types of data are correct or not completely NaN does not seem as useful as looking at the plots.

You will see that I did not plot the multiboxplot nor the scorecards with the regions defined with regionmask. There is probably a way to do it (by adding the mask to the dataset and not only selecting the region, possibly in a copy to avoid changing it), but it seemed a little bit confusing.
test_regionmask_tool.zip

Cheers,
Pablo

Edit: I requested some reviewers that seemed reasonable, but feel free to change that.

        This commit has advantages and disadvantages:

        Advantages
        ----------
        - It does not require pooch, and only uses the standard library.
        - It can deal with URL's not ending with the filename.

        Disadvantages
        -------------
        - It downloads the files locally instead of into the cache (may be
          actually good).
        - It does not add any checksum to the name, risking overwriting files.
	It hangs if latitude and longitude are not coordinates
	-The code was originally looking for
	 data["domain_type"].cf == "domain_name"
	 instead of data["domain_type"] == "domain_name".
	-auto-region:custom_box was wirtten as auto-region:custom.
	Better tests are provided in the PR.
@blychs
Copy link
Collaborator Author

blychs commented Dec 31, 2024

The test that is failing is the Copyright notice, which I wrote according to #321

docs/appendix/yaml.rst Outdated Show resolved Hide resolved
docs/appendix/yaml.rst Outdated Show resolved Hide resolved
@blychs
Copy link
Collaborator Author

blychs commented Feb 4, 2025

Hi all, I would like to leverage on this PR for xarray_plots. Therefore, the sooner we can get this merged, the better. Let me know any edits that should be done.
I would like to finalize xarray_plots as soon as possible. I've been checking and think that I might be able to use this to get rid of pandas for surface and satplots. I haven't looked at aircraftplots yet.

Copy link
Collaborator

@rschwant rschwant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I think all the original capabilities are still possible and the new options are useful too. The naming works for me. I just had one technical question and then I'm happy to approve if Zach @zmoon you also agree?

melodies_monet/driver.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@rschwant rschwant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple small updates, I think and this is ready.

docs/users_guide/region_selection.rst Outdated Show resolved Hide resolved
docs/users_guide/region_selection.rst Outdated Show resolved Hide resolved
melodies_monet/util/region_select.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@rschwant rschwant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for making these updates!

blychs and others added 2 commits February 10, 2025 09:42
correct spelling and typos

Co-authored-by: Zachary Moon <[email protected]>
Add raise_for_status()

Co-authored-by: Zachary Moon <[email protected]>
blychs and others added 4 commits February 13, 2025 10:40
Make the content_disposition more robust if missing

Co-authored-by: Zachary Moon <[email protected]>
boolean type hint to bool

Co-authored-by: Zachary Moon <[email protected]>
Add space in docs

Co-authored-by: Zachary Moon <[email protected]>
@zmoon
Copy link
Collaborator

zmoon commented Feb 13, 2025

@blychs did you have any thoughts about these pooch comments from @DWesl?

It doesn't look like pooch has a way to set the file name after the download has started, but you might be able to guess from the URL. Unfortunately, there's enough variation in how the URL encodes information about the resource it points to that a universal method of composing a simple filename obviously related to the URL is probably not possible.

Pooch's default involves a hash of the URL, would using just the hash without the guess at the filename work? It's not as obviously related to the URL if you want to find the downloaded file in pooch's cache, but the code will be able to reconstruct it easily. It might be worth adding some part of the domain/hostname in the hash to get a little association between URL and cached file name and to reduce the chance of filename collisions.

@blychs
Copy link
Collaborator Author

blychs commented Feb 13, 2025

@blychs did you have any thoughts about these pooch comments from @DWesl?

It doesn't look like pooch has a way to set the file name after the download has started, but you might be able to guess from the URL. Unfortunately, there's enough variation in how the URL encodes information about the resource it points to that a universal method of composing a simple filename obviously related to the URL is probably not possible.
Pooch's default involves a hash of the URL, would using just the hash without the guess at the filename work? It's not as obviously related to the URL if you want to find the downloaded file in pooch's cache, but the code will be able to reconstruct it easily. It might be worth adding some part of the domain/hostname in the hash to get a little association between URL and cached file name and to reduce the chance of filename collisions.

I just answered under that topic. I think the idea is quite nice, but I don't find an easy way to make it work, because I believe that we need to have the extension of the file. Might be wrong though, or there might be a way to infer it.

@DWesl do you have any input on this? If not, I would be happy to see this PR merged, but if there is an easy fix, then I'm also happy to move forward with it

@zmoon
Copy link
Collaborator

zmoon commented Feb 13, 2025

@blychs If you can merge upstream develop into your branch the file header check should pass, if you add an additional # line (see the recent change).

@blychs
Copy link
Collaborator Author

blychs commented Feb 13, 2025

Done

@DWesl
Copy link

DWesl commented Feb 18, 2025

I just answered under that topic. I think the idea is quite nice, but I don't find an easy way to make it work, because I believe that we need to have the extension of the file. Might be wrong though, or there might be a way to infer it.

That's a pooch error then? In that case, parsing through fmt=shp vs. format=shapefile vs. name=file.shp in the parameters vs an actual filename before the parameters would be a never-ending pain.

You could create and update a mapping of URL to filename if you wanted to get back some of the caching pooch does, if you take steps to avoid filename collisions. (If you're not able to persist the mapping between script invocations, you could set up an atexit handler to delete all the files when Python exits).

If you are already using this as a fallback filename:

if fname is None:
    fname = url.rsplit("/", 1)[-1]

changing to have fname = os.path.basename(urllib.parse.urlparse(url).path) might produce a normal filename in more cases, but it might also get rid of information you want to keep . I don't know how much of a concern weird non-descriptive filenames would be, but os.path.basename(url) might be a bit more clear about what you're doing if you want to keep the current approach. (Pooch's default would then be something like "-".join([hashlib.sha1(url.encode('utf-8')).hexdigest(), os.path.basename(urllib.parse.urlparse(url).path)]) or "-".join([hashlib.sha1(url.encode('utf-8')).hexdigest(), os.path.basename(url)]), which are less likely to overwrite other files from the same source, but if you always re-download the file that's less relevant)

You could also do this assignment unconditionally before the Content-Disposition check and overwrite it if the response indicates a preferred filename so fname always has a string value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants