Add support for checking missing sources with extra-data-validate #280

JamesWrigley · 2022-02-22T17:39:58Z

This adds support for checking that sources in runs are not missing from more than some percentage of trains. MID noticed that occasionally certain devices (particularly cameras and fastADCs) would be missing from runs, and this wasn't noticed until later when they tried to analyze the data.

The goal is for this to be run automatically by the RunValidator device, perhaps with a setting to change the missing data threshold. You can test it with this run, which is missing a bunch of data from the AGIPD and some cameras: /gpfs/exfel/exp/MID/202221/p003210/raw/r0008.

Couple of things I'm not too sure about:

Does performance matter? This is implemented quite naively, and on the above run it takes ~8s (vs ~2s without checking sources). Since every file is checked anyway I imagine it's possible to modify the FileValidator to report the data counts for each source/key it contains.
Is it necessary to report every single key that is missing data? I don't know exactly how the DAQ works, but if it's guaranteed to dump all data for a source per-train, then I think it'd be enough to show e.g. <source> is missing data for <n> trains instead of listing every key. But maybe pipeline data is handled differently, e.g. for the big detectors?
Should there be a way to limit which sources are checked? I think not, unless for the sake of performance (in which case I'd rather optimize this implementation as in the first point).
Should it be possible to set a threshold per-source? As this is not much more than a sanity check, I think it's not necessary.

(BTW it's probably easiest to review each commit individually, they're fairly atomic)

Also removed the `suffix` argument from progress_bar() and `nproblems` from RunValidator.progress().

This uses a user-defined threshold of trains with missing data to decide which sources to report.

tmichela · 2022-02-23T08:35:28Z

Thanks James. To answer some of your questions:

Does performance matter?

It is not critical, but it is important

Is it necessary to report every single key that is missing data?

Probably not. In general all keys come from the same data hash in pipeline data sources, so it's probably safe to assume if one is written, all keys from the same hash are too. The MHz detector are a special case, they have 4 sub-sections (image, header, trailer, detector) and each can be individually present or not. (having said that, for analysis, only image is relevant).

now, I'm not sure this is the right place for this check. Currently extra-data-validate purpose is to check that a file is not broken, i.e. it respects the format definition. Missing data is not a file problem and is actually something "normal" as probably almost all runs will have missing data for one or multiple data sources. I also know that people tend to let some instrument selected in the DAQ even if they aren't used, so you get no data for them all the time.
I think it will generate a lot of noise in the output of validate if you check the data ratio.

Also, for your interest, Robert and I are working on extending how we inform people about data issues. We will have a DnD to brainstorm on that idea when Robert is back. And thinking of a way to actively notifying beam staff about these problems is a great use case (is that a reasonable timeframe for you?)

There are also a couple of way already available to "solve" this issue:

using lsxfel

e.g. something like:

$ lsxfel ./r0001 --detail "*:*" | grep "data for" -B2 | sed -n 1~2p

  - HED_EXP_IC1/ADC/1:channel_0.output
      data for 0 trains (0.00%), up to 0 entries per train
  - HED_EXP_IC1/ADC/1:channel_1.output
      data for 0 trains (0.00%), up to 0 entries per train
  - HED_EXP_IC1/ADC/1:channel_2.output
      data for 2734 trains (100.00%), up to 1 entries per train
  ...

the RunValidator device already shows the data ratio per source
The DAQ recently added a metrics table, did MID try to use that?

tmichela · 2022-02-23T08:36:49Z

maybe @takluyver has a different opinion?

JamesWrigley · 2022-02-23T09:04:04Z

now, I'm not sure this is the right place for this check. Currently extra-data-validate purpose is to check that a file is not broken, i.e. it respects the format definition. Missing data is not a file problem and is actually something "normal" as probably almost all runs will have missing data for one or multiple data sources. I also know that people tend to let some instrument selected in the DAQ even if they aren't used, so you get no data for them all the time. I think it will generate a lot of noise in the output of validate if you check the data ratio.

Ouch, yeah if that's the case then this will be waaaay too noisy 😬 I could add an option to only check certain sources, but if that source list comes from the DAQ it doesn't solve anything...

Also, for your interest, Robert and I are working on extending how we inform people about data issues. We will have a DnD to brainstorm on that idea when Robert is back. And thinking of a way to actively notifying beam staff about these problems is a great use case (is that a reasonable timeframe for you?)

Ah hah, that sounds cool. I don't know when Robert is back, but I'll check with MID if the things you listed below would help in the meantime.

2. the RunValidator device already shows the data ratio per source

I think they wanted something more warning-y, but perhaps they don't see those logs in their main scene. I'll check that.

3. The DAQ recently added a metrics table, did MID try to use that?

Oooo that looks very useful 👀 No I'm not sure, I'll ask.

philsmt · 2022-02-23T13:52:41Z

I think the option of quickly checking for this is very useful, but as Thomas suggested it is not universally an error condition for data to be missing on a given train. On the contrary, there are devices either intentionally or by their nature emitting data at less than 10 Hz. But that's not a problem with the feature itself, but giving it a finite default value. It should be disabled by default and easily turned on, even permanently if the instrument chooses so for the run validator.

On the more technical side, I think there are plenty of assumptions you can include to reduce the number of walked keys:

CONTROL data is basically guaranteed to exist for every train the DAQ itself registers, as it copies its value from the previous train if there has been no property update. I'm not entirely sure on this point, but from the top of my head I cannot recall an instance this has not been true. Naturally a DAQ bug may cause this, so one could allow a flag to restrict missing-data-check to INSTRUMENT data only if performance matters.
Going further to INSTRUMENT data, the DAQ indexes fast data by its root key (e.g. data or image). All keys below this key must exist for every input, or the DAQ will drop the entire train on this source.

takluyver · 2022-02-23T14:34:58Z

I think there's a practical problem with this: as I understand it, one of the easiest ways for data to go missing is when the source is not configured in the DAQ. In that case, 100% of the data is missing, but the source name isn't there in the files, so this check wouldn't see a problem.

My inclination is to agree with @tmichela, that we should keep the scope of extra-data-validate limited to 'are these files correctly structured?', and missing data is not an issue with the file structure. But from a user's perspective, I know the difference between a DAQ bug and a bug or a misconfiguration somewhere else in Karabo is not important, and the 'validation' that a lot of people want is 'is my data being saved?' So I don't want to get too hung up on the principle of this.

I wonder if the subjective question of 'is enough of the data that I care about present in these files?' is better dealt answered by a separate tool than the objective one extra-data-validate tries to answer. This is roughly what I was going for with the percentages in lsxfel, but that's still distinctly clumsy.

JamesWrigley · 2022-02-23T17:19:19Z

I think the option of quickly checking for this is very useful, but as Thomas suggested it is not universally an error condition for data to be missing on a given train. On the contrary, there are devices either intentionally or by their nature emitting data at less than 10 Hz. But that's not a problem with the feature itself, but giving it a finite default value. It should be disabled by default and easily turned on, even permanently if the instrument chooses so for the run validator.

That makes sense, and indeed I didn't think of cases where a device doesn't send data at 10Hz (one example I can think of would be the Shimadzu camera at SPB).

I think there's a practical problem with this: as I understand it, one of the easiest ways for data to go missing is when the source is not configured in the DAQ. In that case, 100% of the data is missing, but the source name isn't there in the files, so this check wouldn't see a problem.

Hmm, that's not really the problem I'm trying to solve. If a source is missing from the DAQ then that's definitely an operator error. This validation is only looking for cases where the DAQ is configured correctly and data should have been written, but it wasn't. That's a more insidious problem to discover because technically the bad sources are present in the files, so lsxfel and RunDirectory etc will report them as being present, it's only when someone checks how many trains there are that they'd notice. If a source is missing from the DAQ then at least it'll be obvious as soon as someone tries to use it from the files in any way.

I wonder if the subjective question of 'is enough of the data that I care about present in these files?' is better dealt answered by a separate tool than the objective one extra-data-validate tries to answer.

I don't have a strong opinion on that. Personally I would lean towards keeping it in extra-data-validate for the reasons you mentioned before WRT users not caring about the distinction between formatting and 'is my data being saved'.

So my takeaways are:

This should be optional, and disabled by default since the risk of false positives is high.
It should be possible to specify which sources to include or exclude from the checks.
Performance can be improved by only looking at a single key for each INSTRUMENT source, and could add an option to disable checking control data for performance.

Does that sound good? I think it's simpler if the feature stays in extra-data-validate, but I could also put it in a new command (extra-data-check-missing?).

tmichela · 2022-02-24T07:41:18Z

So my takeaways are:

Yes, that sounds reasonable.

Although, since we are starting to work on a project that is meant to handle any kind of data validity issues (and data loss definitely falls in this scope) I would prefer not to add that right now, and wait until we have a clearer plan on how this project will handle and inform users with data issues.

What about this?

We update the runValidator to make this kind of issues more visible and configurable, since that is the original problem you're trying to solve.
we discuss the direction this validation project should go (and this specific check as part of this project), when Robert is back (in 2 weeks).

JamesWrigley · 2022-02-24T08:50:56Z

Yep yep, that sounds good to me 👍

JamesWrigley added 4 commits February 22, 2022 18:14

Refactor the progress bar to show a percentage instead of file count

1f52c4b

Also removed the `suffix` argument from progress_bar() and `nproblems` from RunValidator.progress().

Remove old argument from docstring

ff7b8d6

Allow make_fxe_run to control how many trains are missing data

45f8293

Add support for checking whether any sources have missing data

ce4d970

This uses a user-defined threshold of trains with missing data to decide which sources to report.

JamesWrigley requested review from takluyver and tmichela February 22, 2022 17:39

JamesWrigley self-assigned this Feb 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for checking missing sources with extra-data-validate #280

Add support for checking missing sources with extra-data-validate #280

JamesWrigley commented Feb 22, 2022 •

edited

Loading

tmichela commented Feb 23, 2022

tmichela commented Feb 23, 2022

JamesWrigley commented Feb 23, 2022

philsmt commented Feb 23, 2022 •

edited

Loading

takluyver commented Feb 23, 2022

JamesWrigley commented Feb 23, 2022

tmichela commented Feb 24, 2022

JamesWrigley commented Feb 24, 2022

Add support for checking missing sources with extra-data-validate #280

Are you sure you want to change the base?

Add support for checking missing sources with extra-data-validate #280

Conversation

JamesWrigley commented Feb 22, 2022 • edited Loading

tmichela commented Feb 23, 2022

tmichela commented Feb 23, 2022

JamesWrigley commented Feb 23, 2022

philsmt commented Feb 23, 2022 • edited Loading

takluyver commented Feb 23, 2022

JamesWrigley commented Feb 23, 2022

tmichela commented Feb 24, 2022

JamesWrigley commented Feb 24, 2022

JamesWrigley commented Feb 22, 2022 •

edited

Loading

philsmt commented Feb 23, 2022 •

edited

Loading