Data Seeding Scripts For Analysis Ready Dataset #53

DarshanSP19 · 2023-09-14T12:09:45Z

Here are some changes in the current script to initialize the zarr store from some specific date and a script to seed the data later on in the dataset.

Change Includes

Added some new arguments to the netcdf_to_zarr script. --init_date, --from_init_date and --only_initialize_store.
Removed --temp_location from arguments as the pipeline is ignoring it while running on DataflowRunner.
Added a script to seed data in the zarr array itself without involving the xarray layer and chunking scheme.
Moved some functions to source_data.py from netcdf_to_zarr.py as it's reused in the data seeding script.

netcdf_to_zarr script can now be used in three different ways.

Initialize stores from start_date to end_date and write chunks. (Current flow)
Initialize stores from init_date to end_date and write chunks from start_date to end_date. The other values which are falling beyond the range of start_date and end_date will remain nans. (Required --from_init_date and optional --init_date.)
Only Initialize the store and not seed data right now. As it can be done via a different script update_data.py. (Required --only_initialize_store and optional --init_date)

Some defaults

By default the script will run the same as before.
For initialization the default init_date will be 1900-01-01. Can be changed via --init_date arg.
By default It'll initialize and start seeding the data, that behavior can be altered via --only_initialize_store which will only create stores and not write data.

alxmrs

Part one of the review. Will follow up with more later.

src/arco_era5/source_data.py

alxmrs · 2023-09-14T17:06:06Z

src/arco_era5/source_data.py

+def get_pressure_levels_arg(pressure_levels_group: str):
+    return PRESSURE_LEVELS_GROUPS[pressure_levels_group]
+
+class LoadTemporalDataForDateDoFn(beam.DoFn):


Please add docstring.

Can we make this a PTransform instead?

src/arco_era5/source_data.py

alxmrs · 2023-09-14T17:07:34Z

src/arco_era5/source_data.py

+            # Make sure we print the date as part of the error for easier debugging
+            # if something goes wrong. Note "from e" will also raise the details of the
+            # original exception.
+            raise Exception(f"Error loading {year}-{month}-{day}") from e


Why do we do this?

src/arco_era5/update.py

alxmrs · 2023-09-14T17:10:01Z

src/arco_era5/update.py

+    init_date: str
+
+    def expand(self, pcoll: beam.PCollection) -> beam.PCollection:
+        return pcoll | beam.Map(update, target=self.target, init_date=self.init_date)


Do you want to make this MapTuple?

alxmrs · 2023-09-14T17:10:19Z

src/arco_era5/update.py

+
+def update(offset_ds: Tuple[int, xr.Dataset, str], target: str, init_date: str):
+    """Generate region slice and update zarr array directly"""
+    key, ds = offset_ds


If this is called with MapTuple, then you wouldn't have to unpack this.

alxmrs · 2023-09-14T17:12:32Z

src/arco_era5/update.py

+    date = datetime.datetime.strptime(init_date, '%Y-%m-%d') + datetime.timedelta(days=offset / HOURS_PER_DAY)
+    date_str = date.strftime('%Y-%m-%d')


Why do we parse a date string into a datetime, then write it back to a string?

This is only for logs. So we can see the logs like below.

Started 10m_u_component_of_wind for 2014-06-29 Done 10m_u_component_of_wind for 2014-06-29

src/arco_era5/update.py

DarshanSP19 · 2023-10-02T11:22:35Z

Closing this PR as I've raised a separate PR including these changes and for CO data back fill scripts. Here #58

DarshanSP19 requested a review from alxmrs September 14, 2023 12:09

DarshanSP19 self-assigned this Sep 14, 2023

alxmrs reviewed Sep 14, 2023

View reviewed changes

Darshan Prajapati added 2 commits September 18, 2023 15:15

Data Seeding Scripts For Analysis Ready Dataset

898c441

Some Minor Nits

3d6adc5

DarshanSP19 force-pushed the ar-backfill-scripts branch from 3a1329b to 3d6adc5 Compare September 19, 2023 12:20

Add inclusive

8e61de2

DarshanSP19 closed this Oct 2, 2023

DarshanSP19 deleted the ar-backfill-scripts branch October 2, 2023 11:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Seeding Scripts For Analysis Ready Dataset #53

Data Seeding Scripts For Analysis Ready Dataset #53

DarshanSP19 commented Sep 14, 2023

alxmrs left a comment

alxmrs Sep 14, 2023

alxmrs Sep 14, 2023

alxmrs Sep 14, 2023

alxmrs Sep 14, 2023

alxmrs Sep 14, 2023

alxmrs Sep 14, 2023

DarshanSP19 Sep 15, 2023 •

edited

Loading

DarshanSP19 commented Oct 2, 2023

		date = datetime.datetime.strptime(init_date, '%Y-%m-%d') + datetime.timedelta(days=offset / HOURS_PER_DAY)
		date_str = date.strftime('%Y-%m-%d')

Data Seeding Scripts For Analysis Ready Dataset #53

Data Seeding Scripts For Analysis Ready Dataset #53

Conversation

DarshanSP19 commented Sep 14, 2023

alxmrs left a comment

Choose a reason for hiding this comment

alxmrs Sep 14, 2023

Choose a reason for hiding this comment

alxmrs Sep 14, 2023

Choose a reason for hiding this comment

alxmrs Sep 14, 2023

Choose a reason for hiding this comment

alxmrs Sep 14, 2023

Choose a reason for hiding this comment

alxmrs Sep 14, 2023

Choose a reason for hiding this comment

alxmrs Sep 14, 2023

Choose a reason for hiding this comment

DarshanSP19 Sep 15, 2023 • edited Loading

Choose a reason for hiding this comment

DarshanSP19 commented Oct 2, 2023

DarshanSP19 Sep 15, 2023 •

edited

Loading