Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MAUDE telemetry for events #343

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

taldcroft
Copy link
Member

@taldcroft taldcroft commented Dec 5, 2024

Description

Support using MAUDE as the source of telemetry for kadi events. This has been a long-standing request from FOT engineering and will allow the events to be up to date with ground telemetry soon after MAUDE dump data ingest completes.

In order to realize this reduction in latency, we need to change a couple of things in ground processing:

  • Run kadi_update_events more frequently. Running this once per hour is realistic, but up for discussion.
  • rsync to GRETA at the same cadence but offset in phase (e.g. once per hour at 10 minutes after the hour).

Other changes

There are numerous unrelated changes in commands related to ruff updates. 😞

Interface impacts

Changes in events:

  • safe_suns: Obsid associated with safe sun event might change (depends on particular timing).
  • normal_suns: Normal suns now strictly match AOPCADMD == "NSUN" with no event fuzzing.
  • scs107s: SCS107 strictly based on RW momentum bias disabled. This is more accurate.
  • eclipses: Many spurious short eclipse events (< 100 s) removed.

Testing

Unit tests

  • No unit tests
  • Mac
  • Linux
  • Windows

Independent check of unit tests by [REVIEWER NAME]

  • [PLATFORM]:

Functional tests

Reprocess over two intervals

Reprocess events over two intervals and compare the results:

  • Safe modes in 2023
  • Eclipses in 2024

This was done following the script validate/events/functional_test.sh.

There are 4 versions of reprocessing that are compared:

  • flight uses flight data (no reprocessing, just using the existing flight database)
  • flight-cxc: Reprocess with installed ska3-flight 2024.11 using CXC data as input.
  • test-cxc: Reprocess using this PR code with CXC data as input.
  • test-maude: Reprocess using this PR code with MAUDE data as input.

There are some substantive differences, but they are all understood and acceptable. We did an in-person review of the diffs.

https://cxc.harvard.edu/mta/ASPECT/tests/kadi/pr343/

Install and run this on HEAD

Run this for at least a week on HEAD in exactly the way that is planned for the flight job (e.g. as an hourly cron job).

@taldcroft taldcroft changed the title Support maude telemetry for events Support MAUDE telemetry for events Dec 5, 2024
@jeanconn
Copy link
Contributor

jeanconn commented Dec 5, 2024

Few questions:

  1. We don't really rsync to GRETA anymore, we run cheta_sync there too. Does the whole cheta sync process run into issues if we make more updates and sync them more frequently?
  2. On a related note, it seems like we run into more opportunities for race conditions on the syncing - would we need any other status monitoring in update jobs to avoid cheta-syncing while an update was in progress? Or because of the way the files are moved into position does it just not matter?
  3. Could there be value in running the the update task conditional on something kicked off by MAUDE? (though if even a job to watch or some such would be a regular cron it would kinda be the same)

One comment:
It seems like functional testing would need to also include a period of real on-the side updates in addition to the faux historical ones.

@taldcroft
Copy link
Member Author

taldcroft commented Dec 6, 2024

Few questions:

  1. We don't really rsync to GRETA anymore, we run cheta_sync there too. Does the whole cheta sync process run into issues if we make more updates and sync them more frequently?

See the sync_ska_data_occ repo, which includes a call to rsync for the non-cheta data. The file in question for kadi events is /proj/sot/ska3/events3.db3, so just that one file would be synced more frequently.

  1. On a related note, it seems like we run into more opportunities for race conditions on the syncing - would we need any other status monitoring in update jobs to avoid cheta-syncing while an update was in progress? Or because of the way the files are moved into position does it just not matter?

I think there is a disconnect. The cheta data archive is specifically unrelated to this PR since the idea is to use MAUDE instead.

  1. Could there be value in running the the update task conditional on something kicked off by MAUDE? (though if even a job to watch or some such would be a regular cron it would kinda be the same)

I've been planning on something equivalent, which is to not do the kadi events update at all (even for non-telemetry events) if there is no new telemetry since the last update. In that way the script can kick off every hour, but only about 1 in 8 times does it actually end up updating events3.db3. Likewise an hourly rsync will mostly just do nothing.

My biggest concern is what happens to a client application that is accessing the database while the file gets updated. One option that would be a possibility with this PR is to actually run the update process on GRETA as well since it has access to MAUDE. My main concern there is reduced visibility into processing problems.

One comment: It seems like functional testing would need to also include a period of real on-the side updates in addition to the faux historical ones.

Agreed, I updated the description.

@jeanconn
Copy link
Contributor

jeanconn commented Dec 6, 2024

Thanks Tom! "I think there is a disconnect" yeah - I immediately went to "what if we were updating the cxc cheta archive based on nrt maude" which is explicitly not what you are doing with this kadi PR. Sorry! I suppose the better question is if the there are any differences between data sources such as timing that will cause us any confusion in the future with the events.

@taldcroft
Copy link
Member Author

My biggest concern is what happens to a client application that is accessing the database while the file gets updated.

I just did some experiments on chimchim with updating the events3.db3 file after doing a few kadi event queries. After changing the file with updates, nothing changed on my side. I then did a more extreme change and deleted events3.db3 entirely and even this did not disrupt the running application. Only after quitting IPython and then starting over did it notice that there was no events database file.

Which is to say that the linux kernel / NFS do interesting things and may protect us from problems related to rsync'ing while people are working.

@taldcroft
Copy link
Member Author

if the there are any differences between data sources such as timing that will cause us any confusion in the future with the events.

There will be differences in timing of up to 1/2 the sample period, so e.g. up to 1/2 second for maneuver event times. In practice this doesn't matter since we already have that disconnect between MAUDE/GRETA and CXC times all over the place. Going forward, using MAUDE times will probably be less confusing for FOT engineers.

@jeanconn
Copy link
Contributor

jeanconn commented Dec 6, 2024

Right, the "does it matters if the the file updates while the user is working" question is also important. My long running cron jobs would still crash if the kadi events changed while the job was running (such that if working on head I usually made a local events copy) but that seems inconsistent with your testing just now.

@jeanconn jeanconn requested a review from Copilot December 31, 2024 15:15

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 7 changed files in this pull request and generated no comments.

Files not reviewed (2)
  • kadi/events/init.py: Evaluated as low risk
  • kadi/events/orbit_funcs.py: Evaluated as low risk
Comments suppressed due to low confidence (8)

validate/events/write_events_ecsv.py:60

  • Rounding float columns to 6 decimal places might lead to loss of precision. Consider whether this rounding is necessary for all use cases.
evts[col.name][:] = evts[col.name].round(6)

validate/events/write_events_ecsv.py:1

  • [nitpick] Consider sorting the import statements for better readability and consistency.
import argparse

validate/events/compare_events_ecsv.py:78

  • The time difference calculation should use np.abs instead of abs for consistency with numpy operations.
time_diff_max = np.max(abs(time_diff))

kadi/events/models.py:878

  • [nitpick] The variable name event_val is now a list. It should be renamed to event_vals to indicate it can hold multiple values.
event_val = ["T", "MOVE"]

kadi/events/models.py:1044

  • [nitpick] The variable name event_val is now a list. It should be renamed to event_vals to indicate it can hold multiple values.
event_val = ["T", "MOVE"]

kadi/events/models.py:1189

  • [nitpick] The variable name event_min_dur should be renamed to event_min_duration for clarity.
event_min_dur = 4.0

kadi/events/models.py:1582

  • The use of strip() on change["val"] assumes it is a string. Ensure that change["val"] is always a string before calling strip().
if change["val"].strip() not in nom_vals[change["msid"]]:

kadi/events/models.py:1715

  • Ensure that the new behavior for handling aux_msids is covered by tests.
aux_msidset = fetch.Msidset(cls.aux_msids, start, stop)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants