Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time_base='union' method #171

Closed
wants to merge 11 commits into from
Closed

Conversation

smwoodman
Copy link

first pass at a time_base='union' method for binary_to_timeseries
also changed several np.NaN to np.nan, per error from numpy 2.0

Copy link
Member

@jklymak jklymak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs discussion. I think your version is pretty idiosyncratic, and if we are to go this route, I think it should be more general. Thanks!

@@ -67,10 +67,10 @@ def extract_timeseries_profiles(inname, outdir, deploymentyaml):
dss['v'] = dss.water_velocity_northward.mean()
dss['v'].attrs = profile_meta['v']
elif 'u' in profile_meta:
dss['u'] = profile_meta['u'].get('_FillValue', np.NaN)
dss['u'] = profile_meta['u'].get('_FillValue', np.nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revert all these - NaN is a perfectly acceptable alias for nan, and the is clutters up this PR.

@@ -807,9 +811,35 @@ def binary_to_timeseries(indir, cachedir, outdir, deploymentyaml, *,
outdir : string
Directory to put the merged timeseries files.

deploymentyaml : str
deploymentyaml : string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deploymentyaml : string
deploymentyaml : str

The python type is str

fnamesuffix : string
Suffix for the output timeseries file

time_base : string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time_base : string
time_base : str, default 'sci_water_temp'

pyglider/slocum.py Outdated Show resolved Hide resolved
Comment on lines 828 to 841
If this value is 'union', then the processing is handled differently,
to allow for 'unioning' the engineering and science timeseries. This
may be useful if for instance you want a full time series, and science
variables are only sampled on dives.

For a value of 'union', the dbdreader MultiDBD.get() method is used
rather than get_sync to read the parameters specified in
deploymentyaml. The argument return_nans (of MultiDBD.get()) is set to
True, so that there are two 'time bases' for the extracted data: one
for engineering variables (from m_present_time), and one for science
variables (from sci_m_present_time). These times are rounded to the
nearest second, and then merged. These values are the time index of
the output file. In this case, only the engineering variables (e.g.,
lat/lon, pitch, roll, m_depth) are interpolated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If this value is 'union', then the processing is handled differently,
to allow for 'unioning' the engineering and science timeseries. This
may be useful if for instance you want a full time series, and science
variables are only sampled on dives.
For a value of 'union', the dbdreader MultiDBD.get() method is used
rather than get_sync to read the parameters specified in
deploymentyaml. The argument return_nans (of MultiDBD.get()) is set to
True, so that there are two 'time bases' for the extracted data: one
for engineering variables (from m_present_time), and one for science
variables (from sci_m_present_time). These times are rounded to the
nearest second, and then merged. These values are the time index of
the output file. In this case, only the engineering variables (e.g.,
lat/lon, pitch, roll, m_depth) are interpolated.
If this value is 'union', then the processing is handled differently,
to allow for 'unioning' the engineering and science timeseries. This
may be useful if for instance you want a full time series, and science
variables are only sampled on dives.
For a value of 'union', the dbdreader MultiDBD.get() method is used
rather than get_sync to read the parameters specified in
deploymentyaml. The argument return_nans (of MultiDBD.get()) is set to
True, so that there are two 'time bases' for the extracted data: one
for engineering variables (from m_present_time), and one for science
variables (from sci_m_present_time). These times are rounded to the
nearest second, and then merged. These values are the time index of
the output file. In this case, only the engineering variables (e.g.,
lat/lon, pitch, roll, m_depth) are interpolated.

This isn't super clear, and not what I understood was going to happen here. If there are two science sensors on different time bases, I thought both times were going to be logged and NaN inserted for the sensors on a different time base.

Not clear on the "rounding to nearest second part". Surely there are plenty of sensors sampled faster than 1 Hz that this will be a bad assumption for. I'm not clear why you are doing this.

Why would the engineering sensors be treated differently than a mismatched science sensor? This seems a mishmash of the two approaches, and you could imagine someone wanting the raw engineering data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be updated now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Docs updated, and hopefully reasonably simplified

_log.debug(f'sensors: {[i for i in sensors]}')

time_base_union = time_base == 'union'
if time_base_union:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the "normal" loop first, and the time_base_union loop second. It makes the code easier to read if the default predeces the non-default.

@smwoodman
Copy link
Author

Thank you for the comments!

I added a commit reverting to NaN (although see comment above), putting the default method first, and starting doc updates. Other comments/questions follow

@smwoodman
Copy link
Author

smwoodman commented Jun 25, 2024

This isn't super clear, and not what I understood was going to happen here. If there are two science sensors on different time bases, I thought both times were going to be logged and NaN inserted for the sensors on a different time base.

I expected that too. In testing though, the dbdreader MultiDBD.get() behavior seemed to be to put all sci variables on sci_m_present_time. I.e., all science variables had the same number of values and returned the same timestamps. I think this is because of here?

Not clear on the "rounding to nearest second part". Surely there are plenty of sensors sampled faster than 1 Hz that this will be a bad assumption for. I'm not clear why you are doing this.

This I think is the key question for this method being general enough. Yep, there are plenty of sensors that sampled faster than 1Hz. Because index of the binary_to_timeseries() output is rounded to/returned as seconds) rather than eg ns, I felt that the merging should happen using the same time units. Otherwise, the merging would happen at the ns scale, but then the final index would be at the second scale and so there would be lots of timestamps with the same value.

Why would the engineering sensors be treated differently than a mismatched science sensor? This seems a mishmash of the two approaches, and you could imagine someone wanting the raw engineering data.

Very fair point, I'll update it. I did this based on our needs/interests, and agree it's inconsistent. Although, should the latitude and longitude values still be interpolated in your opinion?

@smwoodman
Copy link
Author

Hi @jklymak, I added a commit such that no engineering values (including lat/lon) are interpolated.

I also wanted to check in about your thoughts on the above?

@jklymak
Copy link
Member

jklymak commented Jul 17, 2024

OK< let's fix this step by step - I think times should definitely not be rounded, so that is a bug. I'll fix now.

@jklymak
Copy link
Member

jklymak commented Jul 17, 2024

See #177 for better handling of the time.

@smwoodman
Copy link
Author

Changes from #177 (and #173, #175, and #176) merged into this pr, and union of sci/eng times now happens at the int level

@smwoodman
Copy link
Author

Hey @jklymak Sorry for the extra ping, but when you have a chance are there any other adjustments that I can make to this pr at the moment? Or is it close to something you're comfortable with?

@jklymak
Copy link
Member

jklymak commented Aug 10, 2024

OK, my apologies - I guess I still think that this is too orthogonal to the main timeseries method. If you really want to go this way, can we just make a new method binary_to_raw_timeseries, name up to workshopping?

I understand the potential appeal of such a data set, but actually strongly feel you will end up in the end with something like what binary_to_timeseries gives you in the end. If I was really concerned about lining everything up with the CTD, I'd make separate netcdf files for each sensor rather than a single file full of NaNs, and binary_to_timeseries can already do this.

@smwoodman
Copy link
Author

Thanks for your explanation and time with this! Totally respected wrt your feelings on this being too specific of a want for pyglider.

If you really want to go this way, can we just make a new method binary_to_raw_timeseries, name up to workshopping?

I don't follow how binary_to_raw_timeseries would be structured - it would be a new method with basically the same code as binary_to_timeseries, except for using dbd.get(*dbd.parameterNames, return_nans=True).?

Also, a question regarding a wrapper that eg makes a few binary_to_timeseries calls and merges relevant sensor values and metadata. Since utils.get_profile_new requires 'pressure' and can't identify profiles using 'm_depth', how do you recommend getting profile info for eg an engineering-only netcdf file, or a science netcdf file where science is only sampled on dives? Obvs let me know if this should be a separate issue/discussion.

@jklymak
Copy link
Member

jklymak commented Aug 13, 2024

I don't follow how binary_to_raw_timeseries would be structured - it would be a new method with basically the same code as binary_to_timeseries, except for using dbd.get(*dbd.parameterNames, return_nans=True).?

Basically - if there is a lot of commonality, then we could think about refactoring. You have looked at it more carefully than I have though - if you really think it would be shorter to have the if/else loops as you've done here we should continue the discussion. I'm just looking at it from an end-user point of view that these are some different methods.

Since utils.get_profile_new requires 'pressure' and can't identify profiles using 'm_depth', how do you recommend getting profile info for eg an engineering-only netcdf file,

Yeah, we are discussing factoring out the profile creator. If needed we can quickly have a "skip profiles" flag that would work as well.

@smwoodman
Copy link
Author

I'm just looking at it from an end-user point of view that these are some different methods.

Ahh thank you - it just finally clicked for me what 'raw' and 'timeseries' mean in pyglider world, so I now appreciate the importance of specifying 'raw' for such a method.

Basically - if there is a lot of commonality, then we could think about refactoring. You have looked at it more carefully than I have though - if you really think it would be shorter to have the if/else loops as you've done here we should continue the discussion.

There would be some duplication . However, after really looking at it I don't think it would be unreasonable as long as this chunk was factored out, ie was its own function that could be called by both binary_to_timeseries and binary_to_raw_timeseries (and maybe raw_to_timeseries as well). Name proposal timeseries_meta.?

If this all feels better to you, I'd be happy to take a stab at it? And if so, would you prefer that I update this pr or make a new one?

@smwoodman
Copy link
Author

Yeah, we are discussing factoring out the profile creator. If needed we can quickly have a "skip profiles" flag that would work as well.

No need to change this quickly imo. To clarify, is part of the refactoring considering adding an argument to get_profile_new that would let the user specify which data variable to use to identify profiles, eg 'pressure' or 'm_depth'? And if not would you be open to me opening a new issue for this?

@smwoodman
Copy link
Author

Belatedly closing this pr, as we've moved forward with creating multiple (engineering and science) NetCDF files. Thanks for all of the thoughtful discussion.

@smwoodman smwoodman closed this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants