-
Notifications
You must be signed in to change notification settings - Fork 17
HDF5 Ph Data format 0.2 Draft
This documents describes the version "0.2-Draft" of the HDF5-Ph-Data file format.
For a brief introduction what the HDF5 format is and why it is important for single-molecule data see Why an HDF5-based smFRET file format.
NOTE: This specification is still a draft. Comments and suggestions (including typos fixes) are encouraged.
This document contains the specifications for the HDF5-Ph-Data format. This format allows saving single-molecule spectroscopy experiments when there is at least a stream of photon timestamps. It has been envisioned as a standard container format for a broad range of experiments involving confocal microcopy. Notable examples are confocal smFRET experiments (with or without laser alternation) either with a single or with multiple excitation spots. It can also store ns-ALEX or FCS measurements.
- Assuring long term persistence of the data
- A space and speed efficient format for daily use
- Easing dataset sharing and interoperability between analysis programs
- Open, standard and wide-spread used format with opensource implementations (HDF5)
- Efficient: the HDF5 format is a binary format that allows compression and is fast to read and write
- Flexible: data arrays can be stored in "groups" (hierarchical format). Metadata can be attached to each data entry (attributes). No limit in data size. Support for a variety of numeric and non-numeric data types.
The main design principles we follow are
- Simplicity
- Flexibility
- Compatibility
We aim at defining a format that is "small", easy to implement, efficient and expandable while maintaining compatibility.
To achieve "simplicity" we only require the general file layout and the presence of a few basic attributes and parameters. The remaining (small set of) fields here defined will be present only when they will be needed by a particular measurement.
We retain flexibility by allowing the user to save any arbitrary data outside the specs of this document. To assure that a future version of this format will not clash with some user-defined fields, we require that all the user-defined field be contained in groups named user
.
An overview of the data format is show in the following figure
TODO: Show a TOC view of a typical file, identifying root data, photon group data and subgroups. Mention Metadata.
There are mandatory and optional parameters (scalars or arrays of scalars) in the root group, as discussed in the next two sections. The root group also contains one or more photon_data
groups which contain their own data, as discussed in sections 2.2 & 2.3. Finally, there are optional groups providing information on the sample and setup, as well as user-specific information, discussed in the last sections.
-
timestamps_unit
: (float) time in seconds of 1-unit increment in timestamps. Normally, timestamps are integers and the unit increment is determined by the acquisition electronics. However, timestamps can also be floats and express the time in seconds. In this casetimestamps_unit
will set to 1. -
num_spots
: (integer) Normally it is 1 for single-spot measurements. In multi-spot measurements contains the number of excitation or detection spots. -
alex
: (boolean) if True (i.e. = 1), the measurements uses alternated excitation. -
lifetime
: (boolean) if True (i.e. = 1), the data contains nanotime information for each photon (provided by some kind of TAC/TDC i.e. TCSPC hardware) in addition to the standard macrotime information (typically provided by some kind of digital clock with a 10-100 ns period).
OPEN QUESTION: How to handle the case of 2 laser excitation and only 1 laser alternation?
ANSWER: By defining a range for the acceptor excitation period (
alex_period_acceptor
) that selects all the photons (i.e.(0, alex_period)
).
Currently, non mandatory parameters include information necessary to interpret data acquired with alternating laser excitation (ALEX), whether μs-ALEX or ns-ALEX (aka PIE, or pulsed-interleaved excitation). Some of those parameters are mandatory, some other are optional.
Mandatory for ALEX data (ALEX == True
):
-
alex_period
(integer or float): the duration of one excitation alternation period. For μs-ALEX data, it is expressed in timestamp units, such thatalex_period * timestamps_unit
is the alternation period in seconds. For ns-ALEX data (lifetime == True
),alex_period
is expressed in TCSPC bin units, such that the alternation period in seconds isalex_period * tcspc_bin
.
OPEN QUESTION: Why integer OR float and not just float?
Optional for ALEX data:
-
alex_period_donor
: (array with an even-number of elements, ints): The start and stop values identifying the donor emission period. -
alex_period_acceptor
: (array with an even-number of elements, ints): The start and stop values identifying the acceptor emission period.
**QUESTION: What's the difference between integer and ints?
NOTE: For μs-ALEX,
alex_period_donor
andalex_period_acceptor
are both 2-element arrays. For ns-ALEX, they are arrays array with an even-number of elements, comprising as many start-stop pairs as the number of excitation periods in the TAC/TDC range. In both cases, they have the same unit asalex_period
.
The fields
alex_period_donor
andalex_period_acceptor
allow defining photons detected during donor or acceptor excitation. As an example, let's define arrayA
= "timestamps
MODULOalex_period
" as the array of timestamps modulo the ALEX alternation period. Photons emitted during the donor period (respectively acceptor period) are obtained by applying one of these two conditions:
(A > start) and (A < stop)
whenstart < stop
(internal range)(A > start) or (A < stop)
whenstart > stop
(external range).SUGGESTION: this requires a schematic to explain what is meant by internal and external range.
A file can contain one or more photon data groups. For instance, a typical single-spot experiment will contain a single photon data group, while a multispot experiment will in general contain as many photon data groups as there are spots. The multispot case is detailed in Section 2.3. The typical content of a photon data group is briefly described in the next section. The following sections then describe each component in more details.
** An overview of the photon data group format is show in the following figure SUGGESTION: Show a TOC view of a photon data group, identifying arrays and specs subgroups
Each photon data group is named /photon_data_n
where n is an integer designing the spot number (or simply /photon_data
when there is a single spot).
To each photon is associated a fixed number of pieces of information, this number depending on the experiment. The supported types of information are described below. For example, timestamp ("timestamps") and detector ID number ("detectors") would be the minimum number of pieces of information for each photon. Each type of information is stored in an array with size equal to the number of photons in the group.
In addition, parameters (specifications) common to all photons in the group (scalar or arrays of scalars) are stored within separate subgroups. Each subgroup's name end with the suffix "_specs" (for instance "detector_specs").
Finally, flexibility for customization is provided by custom "user" subgroups, which can reside at all levels of the hierarchy (for instance '/photon_data/user/'). Those can be a location to save additional photon or specification information not anticipated by the format.
-
timestamps
: (array of integer or float) contains all timestamps.
** QUESTION: Why float? Because of simulations? That does not fly well with strictly typed languages. I would suggest to stick with integer (specify the type).
-
detectors
: (array of integer or float) contains the detector ID number corresponding to each photon. This array is optional if there is a single detector. Each physical detector (for example donor and acceptor channels) needs to have a unique label (a positive integer including zero). For example, measurements of smFRET with polarization anisotropy using a single donor-acceptor pair require 4 detectors, and therefore need 4 different labels (e.g. 0 - 3). The interpretation of what label corresponds to what detector is done using information provided in the detectors_specs subgroup (see below).
** QUESTION: Why float?That does not fly well with strictly typed languages. I would suggest to stick with integer.
** Mention that what happens if there is detector group on the root folder and in each of the photon_data_n folders.
-
detectors
: (array of integers) contains the detector ID number for each timestamp intimestamps
.
NOTE: the
detectors
array is optional if and only if there is only a single detector or only one detector per spot.
-
nanotimes
(array of integers) contains the TCSPC nanotimes. This array is only required iflifetime
is True. -
particles
: particle label (or ID number) for each timestamp. This optional array is used when the data comes from a simulation providing particle ID information.
Arrays in the photon_data
group can have additional associated information that is not "photon specific" and therefore does not justify the use of an array with one value per photon. This data is instead stored in a subgroup with a _specs
suffix.
To provide information about whether a photon has been detected in the donor or acceptor channel, and/or in the parallel or perpendicular polarization channel, the following arrays are defined inside the detectors_specs
group:
-
donor
: (array of integers) list of detectors for the donor channel. A standard smFRET measurement will have only one value. A smFRET with polarization (4 detectors) will have 2 values. For a multispot measurement, it will contain the list of donor channel detectors. The order matters (what does this mean?). -
acceptor
: (array of integers) list of detectors for the acceptor channel. A standard smFRET measurement will have only one value. A smFRET with polarization (4 detectors) will have 2 values. For a multi-spot measurement it will contain the list of acceptor-channel detectors. The order matters. -
polariz_paral
(array of ints) list of detectors for the parallel polarization. -
polariz_perp
(array of ints) list of detectors for the perpendicular polarization.
** QUESTION: is this the most flexible way of defining Channel+Polarization combination?
Additional detector specifications can be saved in a dedicated subgroup: detectors_specs/user/
.
NOTE 1: If a single spectral channel is acquired, the detector(s) can be put in either
detectors_donor
(detectors_specs/donor
?) ordetectors_acceptor
(detectors_specs/acceptor
?) but not in both. These arrays may be omitted when not relevant. (please, explain)
NOTE 2: If no polarization selection is performed in the detection path, the polarization fields should (can?) be omitted. If only one polarization is acquired the detector ID number should go either in
polariz_paral
orpolariz_perp
, but not both.
a. If a nanotimes
array is present, the following specifications need to be provided:
-
tcspc_bin
: (float) TAC/TDC bin size (in seconds). The same as/nanotimes_unit
. (so why is a different specification needed)? -
tcspc_nbins
: (int) TAC/TDC number of bins. -
tcspc_range
: (float) Full-scale range of the TAC/TDC hardware in seconds.
NOTE In principle
tcspc_range
is equal totcspc_bin*tcspc_nbins
. The redundant parameters simplify reading the data.
b. Optionally, if data comes from simulations, the nanotime specification subgroup can contain the following specifications:
-
tau_accept_only
: (float) Intrinsic Acceptor lifetime (seconds). -
tau_donor_only
: (float) Intrinsic Donor lifetime (seconds). -
tau_fret_donor
: (float) Donor lifetime in presence of Acceptor (seconds). -
tau_fret_trans
: (float) FRET energy transfer lifetime (seconds). Inverse of the rate of DA -> DA.
**QUESTION: These must be calculated values? What is the difference between the penultimate and last ones?
c. Additional specs can be saved in nanotimes_specs/user/
.
Multi-spot measurements can be saved using the basic layout described in previous sections. In this case, the timestamps
array contains all timestamps from all channels and the detectors
array allows identifying detectors. In the case of smFRET measurements the detectors_specs
donor
and acceptor
contains an ordered list of detector numbers, whose length is the number of spots.
** That's where the order matter, so that you can build pairs of donor/acceptors, I suppose?
This structure is convenient to use when CREATING a data file, as it uses only two arrays (one for timestamps, one for detectors) and does not necessitate dispatching each photon in a specific spot photon_data subgroup.
However, it is not a very efficient data structure for repeatedly reading multispot data, because, in order to extract photon-data for a single channel, all timestamps
and detectors
must be first be read and then sorted out.
A more efficient way of storing multispot data, once it has been sorted out, is provided by a layout variant called "multispot layout".
The "multispot layout" is identical to the basic layout for single-spot data. The only difference is that, instead of having a single group /photon_data
, there are now N photon data groups: /photon_data_0
.. /photon_data_N
, one for each spot. Each group has a suffix indicating the spot number (starting from 0).
** explain that there can now be a detector_specs subgroup for each photon_data group. Can there be both a root level detector_specs group and several spot-specific ones? If so, which one supersedes the other one?
The HDF5-Ph-Data defines an optional "sample" group where information about the measured sample can be stored. This data is stored in the group /sample_specs
.
Within /sample_specs
the following fields are defined:
-
num_dyes
: (integer) number of different dyes present in the samples. For a standard single-pair FRET measurement the value is 2. For donor-only or acceptor-only measurements the value is 1. -
dye_names
(array of string) list of dye names (for example:['ATTO550', 'ATTO647N']
). -
buffer_name
(string) free-form description of the sample buffer. For example'TE50 + 1mM TROLOX'
. -
sample_name
(string) free-form description of the sample. For example'40-bp dsDNA, D-A distance: 7-bp'
.
The optional group /setup_specs
contains fields describing the measurement setup:
-
excitation_wavelengths
: (array of floats): array of excitation wavelengths in S.I. units (meters). -
excitation_powers
(array of float): array of excitation powers (in the same order asexcitation_wavelengths
). The powers are expressed in S.I. units (Watts). -
polarizations
(array of float): polarization angle (radians)
** For polarization, we need to find a way to distinguish between linear and elliptically/circularly polarized excitation.
An unlimited number of user-defined fields are allowed. To make sure that future versions of this format will not collide with any user-defined field names, custom data should be contained in a group named user
. A user
group can be placed anywhere in the HDF5 hierachy and should be place wherever it is most logical for the kind of data stored. As an example, user-data can be stored in '/user'
, '/photon_data/user'
, '/photon_data/nanotimes_specs/user'
, '/setup_specs/user'
, etc/
The root node needs to include the following attributes:
format_name = 'HDF5-Ph-Data'
format_title = 'HDF5-based format for time-series of photon data.'
format_version = '0.2'
Each group or array needs to have a description attribute named TITLE
(following the same convention as pytables).
The description attribute for each field are described below using python dictionary syntax:
fields_meta = dict(
# Global data
timestamps_unit = 'Time in seconds of 1-unit increment in timestamps.',
num_spots = 'Number of excitation or detection spots',
alex = 'If True the file contains ALternated EXcitation data.',
lifetime = 'If True the data contains nanotimes from TCSPC hardware',
alex_period = ('The duration of the excitation alternation using '
'the same units as the timestamps.'),
alex_period_donor = ('Start and stop values identifying the donor '
'emission period of us-ALEX measurements'),
alex_period_acceptor = ('Start and stop values identifying the acceptor '
'emission period of us-ALEX measurements'),
# Photon-data
photon_data = ('Group containing arrays of photon-data (one element per '
'photon)'),
timestamps = 'Array of photon timestamps',
detectors = 'Array of detector numbers for each timestamp',
nanotimes = 'TCSPC photon arrival time (nanotimes)',
particles = 'Particle label (integer) for each timestamp.',
detectors_specs = 'Group for detector-specific data.',
donor = 'Detectors for the donor spectral range',
acceptor = 'Detectors for the acceptor spectral range',
polariz_paral = 'Detectors for polarization parallel to excitation',
polariz_perp = 'Detectors for polarization perpendicular to excitation',
nanotimes_specs = 'Group for nanotime-specific data.',
tcspc_bin = 'TCSPC time bin duration in seconds (nanotimes unit).',
tcspc_nbins = 'Number of TCSPC bins.',
tcspc_range = 'TCSPC full-scale range in seconds.',
tau_accept_only = 'Intrinsic Acceptor lifetime (seconds).',
tau_donor_only = 'Intrinsic Donor lifetime (seconds).',
tau_fret_donor = 'Donor lifetime in presence of Acceptor (seconds).',
tau_fret_trans = ('FRET energy transfer lifetime (seconds). Inverse of '
'the rate of D*A -> DA*.'),
)
Additional attributes are allowed in any node but they should not overlap with standard pytables attributes.