Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: add support for noon-centered times to copernicusmarine.open_dataset() #271

Open
Tracked by #878
veenstrajelmer opened this issue Jan 15, 2025 · 4 comments

Comments

@veenstrajelmer
Copy link

veenstrajelmer commented Jan 15, 2025

Motivation
Since the indroduction of ARCO, all time-averaged datasets now have start-of-interval instead of center-of-interval time samples as documented in https://help.marine.copernicus.eu/en/articles/8656000-differences-between-netcdf-and-arco-formats. It makes complete sense to harmonize everything across datasets and I realize the ARCO structure was necessary to prepare for the impressive performance we see in the copernicusmarine toolbox these days.

However, in our institute we use the data as model forcing and we also compare it to measurements. What we see, is that the new time administration shows a clear offset, 12 hours in case of daily-averaged data. This makes sense, but it is quite inconvenient. Of course, it can be manually shifted in our workflows, but I would avoid this if possible.

Furthermore, the PUM states the daily averaged products are centered at noon, not at midnight. So the current behaviour could be confusing for users.:
image

I have been in touch with the helpdesk before, I think the original request was filed under MDSOP-179, if that helps.

Expected behavior
Support in copernicusmarine.open_dataset() for getting noon-centered times, for instance via an additional keyword.

Actual behavior
midnight-centered times, so the reproducible code prints:
1993-01-01 00:00:00 2021-06-30 00:00:00

Steps to reproduce

import copernicusmarine
import pandas as pd
# import logging
# logging.getLogger("copernicusmarine").setLevel(logging.DEBUG)

dataset_id = 'cmems_mod_glo_phy_my_0.083deg_P1D-m'
ds = copernicusmarine.open_dataset(
   dataset_id=dataset_id,
   service="arco-geo-series",
   chunk_size_limit=None,
   )

ds_tstart = pd.Timestamp(ds.time.isel(time=0).values)
ds_tstop = pd.Timestamp(ds.time.isel(time=-1).values)
print(ds_tstart, ds_tstop)

Alternatives
Alternatively, it would at least be helpful if there is averaging metadata present in the returned dataset, for instance as attributes in the time variable. So the fact that it is daily averaged, and that the times are start-of-interval-times. However, I can imagine that is not easy to implement across all averaged datasets. However, it would make it easier to apply the correct time corrections on our side (Deltares/dfm_tools#878).

Environment

  • Python 3.11.11
  • copernicusmarine 2.0.0
@renaudjester
Copy link
Collaborator

We will discuss it internally because as you said:

Of course, it can be manually shifted in our workflows, but I would avoid this if possible.

That could also be the position of the toolbox. But we will get back to you!

@veenstrajelmer
Copy link
Author

veenstrajelmer commented Jan 28, 2025

The easiest way to clarify the offset that I described in the issue description is with a sine wave with a period of several days. This is also the timescale on which processes like zos/temperature/salinity could take place. This is purely for illustration purposes, I realize this is not actual data, but this is way easier to produce:

import matplotlib.pyplot as plt
plt.close("all")
import numpy as np
import pandas as pd

x = pd.date_range("2020-01-01", "2020-02-01", freq="10min")
xrange = np.arange(len(x))
y = np.sin(5/len(x) * np.pi * xrange)
ser = pd.Series(y, index=x)
fig, ax = plt.subplots(figsize=(12,6))
ax.plot(ser.index, ser, label="original data")

daymean_start = ser.groupby(pd.PeriodIndex(ser.index, freq="D")).mean()
daymean_start.index = daymean_start.index.to_timestamp()
daymean_mid = daymean_start.copy()
daymean_mid.index = daymean_mid.index + pd.Timedelta(hours=12)
ax.plot(daymean_start.index, daymean_start, label='mean start-of-interval')
ax.plot(daymean_mid.index, daymean_mid, label='mean center-of-interval')
ax.legend()

Gives:
Image

Usecase:
The modeldata available via copernicusmarine subset/open_dataset (for instance te dataset_id cmems_mod_glo_phy_my_0.083deg_P1D-m) is averaged per day with start-of-interval timestamps (orange line). When comparing this to instantaneous model data from local hydrodynamic models, or instantaneous observations (both represented by the blue line), this shows a clear offset. If the data would be stored with center-of-interval timestamps (green line), there is no such offset

@veenstrajelmer
Copy link
Author

veenstrajelmer commented Jan 29, 2025

I was working on a workaround for this on our side in the meantime, and hoped to be able to use the dataset_id string, since it contains things like "P1D-m" for daily means. However, I realized this is not always consistent, for instance the dataset_id med-cmcc-cur-rean-d here: https://data.marine.copernicus.eu/product/MEDSEA_MULTIYEAR_PHY_006_004/services
Would it be possible to harmonize the dataset_ids so that it is clear if and how the data is averaged? Metadata or center-of-interval (as requested in this issue) would be far more useful. But in the meantime it would be helpful to at least be able to get it from the dataset_id strings.

@renaudjester
Copy link
Collaborator

About this last comment, there is nothing we can do on the toolbox side 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants