Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegEx support of Datasets packages #81

Open
JohnMrziglod opened this issue Jan 3, 2018 · 1 comment
Open

RegEx support of Datasets packages #81

JohnMrziglod opened this issue Jan 3, 2018 · 1 comment
Assignees
Labels
discussion Conversation about feature ideas

Comments

@JohnMrziglod
Copy link
Member

[from @gerritholl]

If I understand generate_filename correctly, the typhon.spareice.datasets approach assumes that the filename can be calculated using only the placeholders in the template. This is not the case for most real datasets. For example, many filenames contain orbit numbers or the string of a downlink station. That means it is necessary to include a regular expression. I'm not sure this is possible with the typhon.spareice.datasets approach but if it isn't, that would be a major limitation.

You are right. So far generate_filename only uses temporal placeholders. I thought about implementing user-defined placeholders but I have not had the time to do it. What do you need them for? Do you want to keep the information from the original filenames and create new filenames with it? A kind of filename conversion? Could you give me a more detailed example? How do you use typhon.Datasets for this?

@JohnMrziglod JohnMrziglod added the discussion Conversation about feature ideas label Jan 3, 2018
@JohnMrziglod JohnMrziglod added this to General in Merge competing Dataset approaches via automation Jan 3, 2018
@JohnMrziglod JohnMrziglod moved this from General to Feature + Support Discussion in Merge competing Dataset approaches Jan 3, 2018
@JohnMrziglod
Copy link
Member Author

JohnMrziglod commented Jan 10, 2018

As a regex example, an example of a HIRS filename is 'NSS.HIRX.NJ.D99127.S0632.E0820.B2241718.WI.gz'. I describe that with the regex r"(L?\d*\.)?NSS.HIR[XS].(?P<satcode>.{2})\.D(?P<year>\d{2})(?P<doy>\d{3})\.S(?P<hour>\d{2})(?P<minute>\d{2})\.E(?P<hour_end>\d{2})(?P<minute_end>\d{2})\.B(?P<B>\d{7})\.(?P<station>.{2})\.gz". Out of those, the parts B and station are present in the filename but not predictable from the starting time. In the case of FCDR_HIRS, I am either reading or writing data and I have both the re approach, and a template based approach:

stored_name = ("FIDUCEO_FCDR_L1C_HIRS{version:d}_{satname:s}_"
               "{year:04d}{month:02d}{day:02d}{hour:02d}{minute:02d}{second:02d}_"
               "{year_end:04d}{month_end:02d}{day_end:02d}{hour_end:02d}{minute_end:02d}{second_end:02d}_"
               "{fcdr_type:s}_v{data_version:s}_fv{format_version:s}.nc")
write_subdir = "{fcdr_type:s}/{satname:s}/{year:04d}/{month:02d}/{day:02d}"
stored_re = (r"FIDUCEO_FCDR_L1C_HIRS(?P<version>[2-4])_"
             r"(?P<satname>.{6})_"
             r"(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})"
             r"(?P<hour>\d{2})(?P<minute>\d{2})(?P<second>\d{2})_"
             r"(?P<year_end>\d{4})(?P<month_end>\d{2})(?P<day_end>\d{2})"
             r"(?P<hour_end>\d{2})(?P<minute_end>\d{2})(?P<second_end>\d{2})_"
             r"(?P<fcdr_type>[a-zA-Z]*)_"
             r"v(?P<data_version>.+)_"
             r"fv(?P<format_version>.+)\.nc")

My file-finder uses the regular expression, but the writing part uses the template. There is a duplication here, ideally one should only need one.

@gerritholl spareice.datasets supports this feature now partly. An user can define regular expressions and use them as placeholders in filenames (currently only in the basename, not in the directory name). Try this example (you need a file named NSS.HIRX.NJ.D99127.S0632.E0820.B2241718.WI.gz):

from typhon.spareice.datasets import Dataset
placeholder = {
    "satcode": "(.{2})",
    "B": "(\d{7})",
    "station": "(.{2})"
}
dataset = Dataset(
    "NSS.HIR[XS].{satcode}.D{year2}{doy}.S{hour}{minute}.E{end_hour}{end_minute}.B{B}.{station}.gz",
    placeholder=placeholder,
)
file_info = dataset.find_file("1999-05-08")
print(file_info)

This prints:

.../NSS.HIRX.NJ.D99127.S0632.E0820.B2241718.WI.gz
  Start: 1999-05-07 06:32:00
  End: 1999-05-07 08:20:00
  Attributes:
    satcode: NJ
    B: 2241718
    station: WI

file_info holds information about the file, you can access the parsed placeholders via file_info.attr. You can use it for generating filenames from other datasets:

other_dataset = Dataset("dummy_file_{year}{month}{day}_{satcode}_B{B}_{station}.dat")
other_dataset.generate_filename("1999-05-08", fill=file.attr)
#  '.../dummy_file_19990508_NJ_B2241718_WI.dat'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Conversation about feature ideas
Projects
Merge competing Dataset approaches
  
Feature + Support Discussion
Development

No branches or pull requests

2 participants