Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend ACCESS-OM3 filename regex #178

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Conversation

dougiesquire
Copy link
Collaborator

Closes #176. Decided just to add a quick fix for this. However, this whole process needs to be revisited soon as it clearly isn't scaling to the range of ACCESS outputs.

Copy link

codecov bot commented Jul 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.33%. Comparing base (6be33d8) to head (06c298e).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #178      +/-   ##
==========================================
+ Coverage   96.98%   97.33%   +0.35%     
==========================================
  Files           9        9              
  Lines         631      902     +271     
==========================================
+ Hits          612      878     +266     
- Misses         19       24       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@anton-seaice anton-seaice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this won't work if the filename patterns change during the experiment? e.g. half the results use ...1900-XX-XX.nc and the other half use 1900-01-daily.nc ?

),
),
(
"access-om3.cice.h.1900-01-daily",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add "access-om3.cice.h.daily.1900-01" ?

@dougiesquire
Copy link
Collaborator Author

I think this won't work if the filename patterns change during the experiment? e.g. half the results use ...1900-XX-XX.nc and the other half use 1900-01-daily.nc ?

I'm not sure what you mean by "won't work" here. They would get distinct fileids and so be considered distinct datasets, which I think is the correct behaviour for the way things are set up currently.

@dougiesquire
Copy link
Collaborator Author

I think we're now heading (back) towards using CMIP6-like vocab for the frequency

E.g. access-om3.cice.h.1day.mean.1900-01.nc

So let's put a pin in the PR until the naming is resolved.

@dougiesquire
Copy link
Collaborator Author

dougiesquire commented Jul 31, 2024

Okay, I think we've finialised a format - see here. I'll update this PR.

@anton-seaice
Copy link
Collaborator

I'm not sure what you mean by "won't work" here. They would get distinct fileids and so be considered distinct datasets, which I think is the correct behaviour for the way things are set up currently.

My concern is that the archive script which turns "access-om3.cice.h.1900-01-01" into "access-om3.cice.h.1900-01-daily" might fail, and then get fixed. Meaning the catalog ends up with two fileids when ideally we want them to be considered the same dataset.

@dougiesquire
Copy link
Collaborator Author

dougiesquire commented Jul 31, 2024

My concern is that the archive script which turns "access-om3.cice.h.1900-01-01" into "access-om3.cice.h.1900-01-daily" might fail, and then get fixed. Meaning the catalog ends up with two fileids when ideally we want them to be considered the same dataset.

Right, this sort of issue can't really be handled with the way dataset grouping is done at the moment. The fix for this currently would be to go back and concatenate the files that were missed. I think in your specific example that's actually the best solution anyway. But we will be revisiting how the grouping is done soon to make it more robust.

@anton-seaice
Copy link
Collaborator

I think with access-om3.cice.1day.mean.1900-01.nc from COSIMA/access-om3#190 (comment) we might not need to make a change ?

@dougiesquire
Copy link
Collaborator Author

@marc-white I got halfway through this and then got distracted. The issue is that the regex description for ACCESS-OM3 is overly prescriptive and doesn't work with, for example, annual files. Would you be able to take this over? (It may be easier to just close this and start again, given your recent changes)

This comment contains a summary of what ACCESS-OM3 output should look like in the not-too-distant future. @anton-seaice can also help answer any questions.

@marc-white
Copy link
Collaborator

Yep, happy to take over.

@minghangli-uni
Copy link

Hi @marc-white, I wanted to check in regarding the status of this PR. I am currently having issues with the regex description for ACCESS-OM3, as it doesn’t seem to work with annual files.

@marc-white
Copy link
Collaborator

@minghangli-uni this got buried under other things, but I'll take a look now.

@marc-white
Copy link
Collaborator

So is this where the filename formats got to? The original issue is pretty convoluted... COSIMA/access-om3#190 (comment)

@marc-white
Copy link
Collaborator

@minghangli-uni could you give me some sample filenames that you're trying to work with?

@minghangli-uni
Copy link

Sure, you can find some in /scratch/tm70/ml0072/access-om3/archive/longerexpt5_rr_mean_snap3-perturb-5873b4db/output000

@marc-white
Copy link
Collaborator

@minghangli-uni I'm a bit confused, which are the annual files in there you're having issues with? I can only see what appear to be monthly files, with a smattering of others...

@minghangli-uni
Copy link

minghangli-uni commented Oct 15, 2024

I think the problem I am having is related to the annual filenames. here

    PATTERNS = [
        rf"[^\.]*\.{PATTERNS_HELPERS['om3_components']}\..*({PATTERNS_HELPERS['ymds']}|{PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']})$",  # ACCESS-OM3
    ]

It now doesnt support something like ['y']?

@marc-white
Copy link
Collaborator

@minghangli-uni ah yes, I see now. @dougiesquire has already made that update in this pull request, although I can see a couple of other things missing when I compare to your sample directory that I'll need to update.

@marc-white
Copy link
Collaborator

@minghangli-uni @dougiesquire @anton-seaice is the GMOM_JRA filename pattern something we need to keep supporting, or should everything under ACCESS-OM3 be changed? I saw @dougiesquire had changed all the filename parsing tests away from GMOM_JRA, but I'll also need to update some test data within the package if it's really going away.

@anton-seaice
Copy link
Collaborator

All of our current configs use the names 'access-om3' but I wouldn't rely on that too much. It would be nice to be able to change that name without having to change any code in the catalog builders.

@marc-white marc-white marked this pull request as draft October 15, 2024 22:58
@marc-white marc-white self-assigned this Oct 15, 2024
@marc-white
Copy link
Collaborator

@anton-seaice it's more that the current regex updates for ACCESS-OM3 have broken one of the GMOM_JRA patterns, so I need to know if I need to rectify that, or if it's OK to dispose of support for that pattern (plus I'll need to grab some new test data for the catalog creation).

@anton-seaice
Copy link
Collaborator

Which pattern isn't working ? Its probably fine to dispose of it but also nice if users can make modifications to filenames if they want without breaking things

@marc-white
Copy link
Collaborator

marc-white commented Oct 16, 2024

I managed to work out the issue with the regular expressions, and it seems to now be in a state where it works with both the new access-om3 and older GMOM_JRA patterns.

During this, I ran into an interesting regex conundrum that sucked up a lot of my day, which I wondered if anyone had seen before...

The current AccessOM3Builder in main uses an OR-style block to match the date group part of a filename:

PATTERNS = [
        rf"[^\.]*\.{PATTERNS_HELPERS['om3_components']}\..*({PATTERNS_HELPERS['ymds']}|{PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']})$",  # ACCESS-OM3
    ]

(look at the group in parenthesis that references the 'ymds', 'ymd', etc. keys of PATTERNS_HELPERS, with | in between each option).

The re documentation says that | searches left to right, taking the first match it finds. So far, so good - we want to see if we can match a Y-M-D-S pattern, if that doesn't work we try the simpler Y-M-D pattern, and so on.

However, if I added a Y only pattern to the end of the | chain, this would happen:

In [42]: b = re.search(AccessOm3Builder.PATTERNS[0], "GMOM_JRA_WD.ww3.hi.1900-01-03-00000.nc")

In [43]: b.group(0)
Out[43]: 'GMOM_JRA_WD.ww3.hi.1900_01_03_00000.nc'

In [44]: b.group(1)
Out[44]: '0000'

It's almost like the | search suddenly decided to deliberately search for the minimum match possible, and gave me back the last year match it found. Removing the Y pattern gets me back to the result I was expecting:

In [32]: b = re.search(AccessOm3Builder.PATTERNS[0], "GMOM_JRA_WD.ww3.hi.1900-01-03-00000.nc")

In [33]: b.group(1)
Out[33]: '1900-01-03-00000'

Any thoughts from anyone (but particularly @dougiesquire )?

@dougiesquire
Copy link
Collaborator Author

Any thoughts from anyone (but particularly @dougiesquire )?

I agree that's odd. Unfortunately nothing springs to mind sorry. I'm definitely not a regex expert.

@anton-seaice
Copy link
Collaborator

Thats pretty odd, I guess you could test it by changing the order of patterns in AccessOm3Builder.PATTERNS[0] ?

@marc-white
Copy link
Collaborator

Thats pretty odd, I guess you could test it by changing the order of patterns in AccessOm3Builder.PATTERNS[0] ?

Yep, tried that. It always seemed to lock on to the "Y" pattern, regardless of where I put it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ACCESS-OM3 filename patterns are too prescriptive
4 participants