-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Seeding Scripts For Analysis Ready Dataset #53
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part one of the review. Will follow up with more later.
def get_pressure_levels_arg(pressure_levels_group: str): | ||
return PRESSURE_LEVELS_GROUPS[pressure_levels_group] | ||
|
||
class LoadTemporalDataForDateDoFn(beam.DoFn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a PTransform instead?
# Make sure we print the date as part of the error for easier debugging | ||
# if something goes wrong. Note "from e" will also raise the details of the | ||
# original exception. | ||
raise Exception(f"Error loading {year}-{month}-{day}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we do this?
src/arco_era5/update.py
Outdated
init_date: str | ||
|
||
def expand(self, pcoll: beam.PCollection) -> beam.PCollection: | ||
return pcoll | beam.Map(update, target=self.target, init_date=self.init_date) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to make this MapTuple
?
src/arco_era5/update.py
Outdated
|
||
def update(offset_ds: Tuple[int, xr.Dataset, str], target: str, init_date: str): | ||
"""Generate region slice and update zarr array directly""" | ||
key, ds = offset_ds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is called with MapTuple
, then you wouldn't have to unpack this.
src/arco_era5/update.py
Outdated
date = datetime.datetime.strptime(init_date, '%Y-%m-%d') + datetime.timedelta(days=offset / HOURS_PER_DAY) | ||
date_str = date.strftime('%Y-%m-%d') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we parse a date string into a datetime, then write it back to a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for logs. So we can see the logs like below.
Started 10m_u_component_of_wind for 2014-06-29
Done 10m_u_component_of_wind for 2014-06-29
3a1329b
to
3d6adc5
Compare
Closing this PR as I've raised a separate PR including these changes and for CO data back fill scripts. Here #58 |
Here are some changes in the current script to initialize the
zarr
store from some specific date and a script to seed the data later on in the dataset.Change Includes
netcdf_to_zarr
script.--init_date
,--from_init_date
and--only_initialize_store
.--temp_location
from arguments as the pipeline is ignoring it while running onDataflowRunner
.zarr
array itself without involving thexarray
layer and chunking scheme.source_data.py
fromnetcdf_to_zarr.py
as it's reused in the data seeding script.netcdf_to_zarr
script can now be used in three different ways.start_date
toend_date
and write chunks. (Current flow)init_date
toend_date
and write chunks fromstart_date
toend_date
. The other values which are falling beyond the range ofstart_date
andend_date
will remainnans
. (Required--from_init_date
and optional--init_date
.)update_data.py
. (Required--only_initialize_store
and optional--init_date
)Some defaults
init_date
will be1900-01-01
. Can be changed via--init_date
arg.--only_initialize_store
which will only create stores and not write data.