File format definition overhaul #357

philsmt · 2022-11-30T14:58:52Z

As part of the PRWG discussions and unrelated discussions in the CAL team, we realized our "documentation" of the EuXFEL file structure is both the only one existing as well as out of date. In addition as part of the former, it would be useful to describe the file format in a more generic way and not only bound to files written by the DAQ from Karabo.

This is a first draft of this. I expect a lot of discussions and further work on this. The intention was to describe the structure as abstract as possible to how and what data actually ends up there, yet still with comments to how the "typical use case" in the form of recorded DAQ runs looks like.

I left some of my own open questions in the document, but a brief summary:

Do we keep the specification to the last version (at the moment 1.2) and only track changes?
Do we make the distinction explicit between slow and fast data? Unfortunately there is difference in the file structure, but likely only actual Karabo data cares
We should add references all over the place for EuXFEL-specific terminology, starting with things like train
Grey areas of quirky DAQ behaviour, e.g. you got an empty RUN top-level group even if there's no CONTROL group
How DAQ-specific are most of the METADATA datasets? What about custom datasets, like for pycalibration?
What about first/last/status datasets in INDEX/<source>? Never saw those myself, move to appendix or ignore entirely?

You can find a built version of this branch here.

philsmt · 2022-11-30T14:59:30Z

It's nothing official, but I played with using the term EXDF recently referenced in XDAC talks.

takluyver · 2022-11-30T17:29:35Z

Thanks for tackling this. 👍

Do we keep the specification to the last version (at the moment 1.2) and only track changes?

I think we'll ultimately want a record of what changed between versions, but most of the contents of the page can be focused on the latest version. Hopefully it won't change very often.

The changes section should also make it clear that this is only describing the file structure, not the contents. E.g. if the axis order for a detector changes, that's a problem and people are going to be annoyed about it, but it's beyond the scope of this document.

Do we make the distinction explicit between slow and fast data? Unfortunately there is difference in the file structure, but likely only actual Karabo data cares

I think they're different enough in the files that we have to - e.g. control data always has value & timestamp datasets.

Grey areas of quirky DAQ behaviour, e.g. you got an empty RUN top-level group even if there's no CONTROL group

We'll have to do that case-by-case, really.

For the specific case of an empty RUN group, I don't think we need to document it. It's unlikely anything relies on this group existing, and we don't want to be bound to always create it.

How DAQ-specific are most of the METADATA datasets? What about custom datasets, like for pycalibration?

daqLibrary and karaboFramework are clearly DAQ specific. Maybe we should have a more generic 'writer' text field with details of the software that wrote the file.

I think the rest all have some sort of general meaning (creationDate, dataFormatVersion, proposalNumber, runNumber, sequenceNumber, updateDate). But proposal & run numbers assume data is always collected as part of a run - maybe we should write down what's expected if you e.g. write simulated data, or dump data from Influx for a custom time period, or write out combined data from multiple runs.

What about first/last/status datasets in INDEX/? Never saw those myself, move to appendix or ignore entirely?

It looks like the first round of experiments in 2017 were saved in that format, but then by 2018 it had been replaced with the first/count format. This was before we'd got version numbers for the format, so it just happened. It's probably a historical curiosity at this point.

First draft of file format definition overhaul

fb6cdc8

philsmt changed the title ~~First draft of file format definition overhaul~~ File format definition overhaul Nov 30, 2022

philsmt added 3 commits December 1, 2022 17:02

Add proper references

3ec7de9

Add description of file format version changes

efccb8b

(fixup) Minor clean-up

324b428

philsmt force-pushed the docs/update-file-format branch from de7601d to a276150 Compare December 2, 2022 14:22

(fixup) Clarify file format versions

26573fd

philsmt force-pushed the docs/update-file-format branch from a276150 to 26573fd Compare December 5, 2022 08:38

philsmt added 7 commits April 9, 2023 22:05

(fixup) Add data format version 1.3

fa02237

(fixup) Minor fixes

fef4129

(fixup) Extend INSTRUMENT section

eb3951a

(fixup) Sort METADATA into obligatory and DAQ-related fields

c80fd1e

(fixup) Minor clarifications around terms and versions

aee1143

(fixup) Formatting fixes and remove comments

c2d7534

(fixup) Some wording fixes

8beda3a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File format definition overhaul #357

File format definition overhaul #357

philsmt commented Nov 30, 2022 •

edited

Loading

philsmt commented Nov 30, 2022

takluyver commented Nov 30, 2022

File format definition overhaul #357

Are you sure you want to change the base?

File format definition overhaul #357

Conversation

philsmt commented Nov 30, 2022 • edited Loading

philsmt commented Nov 30, 2022

takluyver commented Nov 30, 2022

philsmt commented Nov 30, 2022 •

edited

Loading