Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File format definition overhaul #357

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from
Draft

Conversation

philsmt
Copy link
Contributor

@philsmt philsmt commented Nov 30, 2022

As part of the PRWG discussions and unrelated discussions in the CAL team, we realized our "documentation" of the EuXFEL file structure is both the only one existing as well as out of date. In addition as part of the former, it would be useful to describe the file format in a more generic way and not only bound to files written by the DAQ from Karabo.

This is a first draft of this. I expect a lot of discussions and further work on this. The intention was to describe the structure as abstract as possible to how and what data actually ends up there, yet still with comments to how the "typical use case" in the form of recorded DAQ runs looks like.

I left some of my own open questions in the document, but a brief summary:

  • Do we keep the specification to the last version (at the moment 1.2) and only track changes?
  • Do we make the distinction explicit between slow and fast data? Unfortunately there is difference in the file structure, but likely only actual Karabo data cares
  • We should add references all over the place for EuXFEL-specific terminology, starting with things like train
  • Grey areas of quirky DAQ behaviour, e.g. you got an empty RUN top-level group even if there's no CONTROL group
  • How DAQ-specific are most of the METADATA datasets? What about custom datasets, like for pycalibration?
  • What about first/last/status datasets in INDEX/<source>? Never saw those myself, move to appendix or ignore entirely?

You can find a built version of this branch here.

@philsmt philsmt changed the title First draft of file format definition overhaul File format definition overhaul Nov 30, 2022
@philsmt
Copy link
Contributor Author

philsmt commented Nov 30, 2022

It's nothing official, but I played with using the term EXDF recently referenced in XDAC talks.

@takluyver
Copy link
Member

Thanks for tackling this. 👍

Do we keep the specification to the last version (at the moment 1.2) and only track changes?

I think we'll ultimately want a record of what changed between versions, but most of the contents of the page can be focused on the latest version. Hopefully it won't change very often.

The changes section should also make it clear that this is only describing the file structure, not the contents. E.g. if the axis order for a detector changes, that's a problem and people are going to be annoyed about it, but it's beyond the scope of this document.

Do we make the distinction explicit between slow and fast data? Unfortunately there is difference in the file structure, but likely only actual Karabo data cares

I think they're different enough in the files that we have to - e.g. control data always has value & timestamp datasets.

Grey areas of quirky DAQ behaviour, e.g. you got an empty RUN top-level group even if there's no CONTROL group

We'll have to do that case-by-case, really.

For the specific case of an empty RUN group, I don't think we need to document it. It's unlikely anything relies on this group existing, and we don't want to be bound to always create it.

How DAQ-specific are most of the METADATA datasets? What about custom datasets, like for pycalibration?

daqLibrary and karaboFramework are clearly DAQ specific. Maybe we should have a more generic 'writer' text field with details of the software that wrote the file.

I think the rest all have some sort of general meaning (creationDate, dataFormatVersion, proposalNumber, runNumber, sequenceNumber, updateDate). But proposal & run numbers assume data is always collected as part of a run - maybe we should write down what's expected if you e.g. write simulated data, or dump data from Influx for a custom time period, or write out combined data from multiple runs.

What about first/last/status datasets in INDEX/? Never saw those myself, move to appendix or ignore entirely?

It looks like the first round of experiments in 2017 were saved in that format, but then by 2018 it had been replaced with the first/count format. This was before we'd got version numbers for the format, so it just happened. It's probably a historical curiosity at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants