Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Seeding Scripts For Analysis Ready Dataset #53

Closed
wants to merge 3 commits into from

Conversation

DarshanSP19
Copy link
Collaborator

Here are some changes in the current script to initialize the zarr store from some specific date and a script to seed the data later on in the dataset.

Change Includes

  • Added some new arguments to the netcdf_to_zarr script. --init_date, --from_init_date and --only_initialize_store.
  • Removed --temp_location from arguments as the pipeline is ignoring it while running on DataflowRunner.
  • Added a script to seed data in the zarr array itself without involving the xarray layer and chunking scheme.
  • Moved some functions to source_data.py from netcdf_to_zarr.py as it's reused in the data seeding script.

netcdf_to_zarr script can now be used in three different ways.

  • Initialize stores from start_date to end_date and write chunks. (Current flow)
  • Initialize stores from init_date to end_date and write chunks from start_date to end_date. The other values which are falling beyond the range of start_date and end_date will remain nans. (Required --from_init_date and optional --init_date.)
  • Only Initialize the store and not seed data right now. As it can be done via a different script update_data.py. (Required --only_initialize_store and optional --init_date)

Some defaults

  • By default the script will run the same as before.
  • For initialization the default init_date will be 1900-01-01. Can be changed via --init_date arg.
  • By default It'll initialize and start seeding the data, that behavior can be altered via --only_initialize_store which will only create stores and not write data.

@DarshanSP19 DarshanSP19 self-assigned this Sep 14, 2023
Copy link
Collaborator

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part one of the review. Will follow up with more later.

src/arco_era5/source_data.py Show resolved Hide resolved
def get_pressure_levels_arg(pressure_levels_group: str):
return PRESSURE_LEVELS_GROUPS[pressure_levels_group]

class LoadTemporalDataForDateDoFn(beam.DoFn):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docstring.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a PTransform instead?

src/arco_era5/source_data.py Outdated Show resolved Hide resolved
src/arco_era5/source_data.py Outdated Show resolved Hide resolved
# Make sure we print the date as part of the error for easier debugging
# if something goes wrong. Note "from e" will also raise the details of the
# original exception.
raise Exception(f"Error loading {year}-{month}-{day}") from e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do this?

src/arco_era5/update.py Outdated Show resolved Hide resolved
init_date: str

def expand(self, pcoll: beam.PCollection) -> beam.PCollection:
return pcoll | beam.Map(update, target=self.target, init_date=self.init_date)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to make this MapTuple?


def update(offset_ds: Tuple[int, xr.Dataset, str], target: str, init_date: str):
"""Generate region slice and update zarr array directly"""
key, ds = offset_ds
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is called with MapTuple, then you wouldn't have to unpack this.

Comment on lines 17 to 18
date = datetime.datetime.strptime(init_date, '%Y-%m-%d') + datetime.timedelta(days=offset / HOURS_PER_DAY)
date_str = date.strftime('%Y-%m-%d')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we parse a date string into a datetime, then write it back to a string?

Copy link
Collaborator Author

@DarshanSP19 DarshanSP19 Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for logs. So we can see the logs like below.

Started 10m_u_component_of_wind for 2014-06-29
Done 10m_u_component_of_wind for 2014-06-29

src/arco_era5/update.py Outdated Show resolved Hide resolved
@DarshanSP19
Copy link
Collaborator Author

Closing this PR as I've raised a separate PR including these changes and for CO data back fill scripts. Here #58

@DarshanSP19 DarshanSP19 closed this Oct 2, 2023
@DarshanSP19 DarshanSP19 deleted the ar-backfill-scripts branch October 2, 2023 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants