This package implements Pytorch-style Iterable Datasets for D3D data. It contains It contains helper function to fetch data from MDS and store it locally in HDF5 files. It also contains custom samplers to easily load batch random sequences from the data. These can be used in conjunction with data loaders.
Import and instantiate the loader like this:
>>> from d3d_loaders.d3d_loaders import D3D_dataset
>>> from d3d_loaders.samplers import RandomBatchSequenceSampler, collate_fn_random_batch_seq
>>> shotnr = 172337
>>> t_params = {'tstart' : 200.0,
'tend' : 1000.0,
'tsample': 1.0
>>> }
>>> shift_targets = {'ae_prob': 10.0}
>>> device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
>>> my_ds = D3D_dataset(shotnr, t_params,
>>> predictors=["pinj", "tinj", "iptipp", "dstdenp", "doutu", "dssdenest", "ae_prob"],
>>> targets=["ae_prob"],
>>> shift_targets=shift_targets,
>>> datapath="/projects/EKOLEMEN/d3dloader/test",
>>> device=device)
Here shotnr
gives the shot from which the signals are used. Together tstart
, tend
, and
tsample
define a common time base for all signals. The signals are resampled in a manner that
conserves causality, that is, no future information is used to generate a single sample.
The dataset defines two groups of signals. Predictors
are to be used as input to a model and
targets
are to be used as ground-truth outputs of the model. The dictionary shift_targets
defines an offset by which targets are shifted into the future. The name of the signals
refer to either MDS
nodes or PTDATA
point names. The names of the signals and their
location in MDS
or PTDATA
is defined in the file downloading.py
. A custom signal,
ae_prob
refers to output of Aza's Alfven Eigenmode probability model.
Predictor and target signals are stored in the dictionaries my_ds.predictors
and my_ds.target
respectively.
The datapath
argument defines the directory where the D3D data will be cached
in HDF5
files. There is one HDF5
file per shot, which stores the individual signals within
a group. If a signal is not found in that group, it will be fetched automatically and added to
the file. The layout of the files are
julia> fid = h5open("/projects/EKOLEMEN/d3dloader/test/172337.h5", "r")
ποΈ HDF5.File: (read-only) /projects/EKOLEMEN/d3dloader/test/172337.h5
ββ π ali
β ββ π·οΈ origin
β ββ π’ xdata
β β ββ π·οΈ xunits
β ββ π’ zdata
β ββ π·οΈ zunits
ββ π doutu
β ββ π·οΈ origin
β ββ π’ xdata
β β ββ π·οΈ xunits
β ββ π’ zdata
β ββ π·οΈ zunits
ββ π dssdenest
β ββ π·οΈ origin
β ββ π’ xdata
β ββ π’ zdata
ββ π dstdenp
β ββ π·οΈ origin
β ββ π’ xdata
β ββ π’ zdata
...
Each group is named after the shortname signal and contains xdata
, zdata
nodes returned from gadata.py
.
Additionally, the origin of the data is stored in the origin
attribute, and the units of the signals
are stored in the xunits
and zunits
attributes, when available.
Additionally, a device
can be passed. If set to be the GPU, the predictor
and
target
tensors will be stored on the GPU. This avoids having to load batches into GPU memory in the
training data loop.
To fetch data sequences from a single shot, we can instantiate a DataLoader
. This package
implements a RandomBatchSequenceSampler
that fetches linear sequences from both, predictor
and target
tensors of the dataset. RandomBatchSequenceSampler
fetches a batch of sequences,
all of the same length and starting at the same random point. The sample below shows how
D3D_dataset
and RandomBatchSequenceSampler
work together to allow iteration over the sequences:
>>> len(ds)
800
>>> batch_size = 32
>>> seq_length = 512
>>> sampler = BatchedSampler(len(ds), seq_length=seq_length, batch_size=batch_size, shuffle=True)
>>> loader_train = torch.utils.data.DataLoader(my_ds, num_workers=0,
batch_sampler=sampler,
collate_fn = collate_fn_batched)
>>> for x_b, y_b in loader_train:
print(f"x_b.shape={x_b.shape}, y_b.shape={y_b.shape}")
x_b.shape=torch.Size([32, 513, 11]), y_b.shape=torch.Size([32, 513, 5])
x_b.shape=torch.Size([32, 513, 11]), y_b.shape=torch.Size([32, 513, 5])
x_b.shape=torch.Size([32, 513, 11]), y_b.shape=torch.Size([32, 513, 5])
x_b.shape=torch.Size([32, 513, 11]), y_b.shape=torch.Size([32, 513, 5])
x_b.shape=torch.Size([32, 513, 11]), y_b.shape=torch.Size([32, 513, 5])
x_b.shape=torch.Size([32, 513, 11]), y_b.shape=torch.Size([32, 513, 5])
x_b.shape=torch.Size([32, 513, 11]), y_b.shape=torch.Size([32, 513, 5])
x_b.shape=torch.Size([32, 513, 11]), y_b.shape=torch.Size([32, 513, 5])
x_b.shape=torch.Size([32, 513, 11]), y_b.shape=torch.Size([32, 513, 5])
In each iteration, loader_train
returns a tuple of tensors. The first tensor contains
32=batch_size
sequences of length seq_length+1
. Each sequence has 11
features, which
is the sum of the feature dimension of each predictor
signal. The second tensor has the
same batch size and sequence length, but the feature dimension is 5
, corresponding to
the probability of each Alfven Eigenmode.
The length of the sequences is seq_length+1=513
. Note also, that the predictor
signals are
shifted 10ms
into the future. That is the samples at y_b[:, -1, :]
are 11ms
ahead
of the samples in x_b[:, -2, :]
.
In the example above, collate_fn_batched
reshapes the returned data
from d3d_dataset.__getidx__
into a tuple of 2 tensors. The call semantics are explained in the
code example here.
Note that RandomBatchSequenceSampler
takes care of batching. Automatic batching in the pytorch
DataLoader
needs to be disabled.
The BatchedSampler
can either sample sequences linearly, starting at 0, or shuffle
the starting points of the sequences.
For shuffle=False
, BatchedSampler
generates sequences where the starting index shifts by 1 for
each sequence in a batch. Using a BatchedSampler
in the example above would have
the first sequence start at index 0 and extending to 512. The second sequence
would start at index 1 and extend to 513. And so on.
This is useful for inference, where we may want to predict a target sequentially over the
entire shot. We input samples of all predictors at time index 0...511 into the model and get a
prediction for the target at time index 512. Then we input the predictors at time index
1..512 to predict the target at time index 513. And so in. Thus, by iterating over a
DataLoader which uses a BatchedSampler
, we can easily get a reconstruction of the
predicted target for the entire shot.
Setting shuffle=True
, BatchedSampler
generates sequences that start at a random point
in the dataset. Looking at the code above, the first of the 32 sequences in the batch
may start at index 18 (and extend to index 18+513=631) of the entire shot.
The second sequence may start at index 788 and extend to index 788+513.
So every starting index is at random. This is useful for training.
There are also the versions of the sampler BatchedSampler_dist
. These should
be used for distributed training. They effectively distribute all available samples
in the dataset across ranks.
THIS IS A BIT OUTDATED. THE INTERFACE TO THE MULTISHOT DATASET HAS CHANGED. SEE https://github.com/PPPLDeepLearning/frnn_examples FOR PROPER USAGE.
The class Multishot_dataset
defines a dataset that spans multiple shots. It can be instantiated in
a very similar manner:
>>> from d3d_loaders.d3d_loaders import Multishot_dataset
>>> from d3d_loaders.samplers import BatchedSampler_multi, collate_fn_batched
>>> shot_list_train = [172337, 172339]
>>> tstart = 110.0 # Time of first sample for upper triangularity is 100.0
>>> tend = 2000.0
>>> t_params = {"tstart": tstart, "tend": tend, "tsample": 1.0}
>>> t_shift = 10.0
>>> device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
>>> seq_length = 512
>>> batch_size = 4
>>> pred_list = ["pinj", # Injected power
>>> "tinj", # Injected torque
>>> "iptipp", # Target current
>>> "dstdenp", # Target density
>>> "doutu", # Top triangularity
>>> "dssdenest", # Line-averaged density
>>> "ae_prob"] # AE mode probability
>>> targ_list = ["ae_prob"]
>>> ds_train = Multishot_dataset(shot_list_train, t_params, pred_list, targ_list,
>>> {"ae_prob": t_shift},
>>> datapath="/projects/EKOLEMEN/d3dloader/test", device)
To iterate over a multi-shot database, we can again instantiate a DataLoader using the custom
RandomBatchSequenceSampler_multids
sampler as well as a custom collate_fn
:
>>> for xb, yb in loader_train_b:
>>> print(xb.shape, yb.shape)
>>> break
torch.Size([513, 4, 11]) torch.Size([513, 4, 5])
Again, the tensor tuple xb
and yb
contain a batch of random sequences. But the sequences are picked
at random from the list of shots used to construct the Multishot_dataset
. All sequences are constructed
from data of only a single shot.
MDS and PTDATA can be downloaded with the script downloading.py
. This script fetches data for
a given shot from D3D's MDSplus server and stores in in a HDF5 file.
The current predictors are:
Predictor | Key |
---|---|
AE Mode probabilities | ae_prob |
Pinj | pinj |
Neutron Rate | neutronsrate |
Injected Power | ip |
ECH | ech |
q95 | q95 |
kappa | kappa |
Density Profile | dens |
Pressure Profile | pres |
Temperature Profile | temp |
q Profile | q |
Lower Triangularity | doutl |
Upper Triangularity | doutl |
Raw ECE Channels | raw_ece |
Raw CO2 dp Channels | raw_co2_dp |
Raw CO2 pl Channels | raw_co2_pl |
Raw MPI Channels | raw_mpi |
Raw BES Channels | raw_bes |
Bill's AE Labels (+-250ms window) | uci_label |
NOTE: the shape signal of kappa and shape profiles of upper and lower triangularity don't have data even though
the keys exist in the hdf5 files. You should get a helpful error telling you this, however you may also get
an error telling you of an array splicing error. Also when these signals do exist, the first signal is very
late, so make sure tstart
is sufficiently large.
At a given index, the output is a list of 2 tensors.
In [119]: my_ds[0]
Out[119]:
(tensor([-0.5374, -0.4721, -0.5583, -0.5301, -0.4714, -1.1228, -1.7159]),
tensor([-0.0207, -0.0328, -0.0478, -0.0331, -0.0119]))
The first tensor has a number of elements fixed by the number of signals being used.
The order will match the order of the signals in predictors
.
Each of these signals is normalized to their separate mean and standard deviation.
Currently, the only target is ae_prob_delta
, calculated using AE probabilities from Aza's RCN model.
The output tensor (2nd tensor) is the change in the Alfven Eigenmode probabilities over the interval given in
shift_targets
with the ae_prob_delta
key,
i.e. AE mode probability at t0 + 10ms as calculated using ECE data. This output is also
normalized to zero mean and unit standard deviation.
Iteration over the dataset is done in pytorch-style:
for i in torch.utils.data.DataLoader(my_ds):
# use i
An example of how to use this dataloader for predictive modelling is in 'runme.py'
By setting tsample=-1
, the full resolution will return with the tstart
and tend
found closest to the true measurement values (forwards or backwards in time). This will
not load as a dataset since you cannot temporally align the full resolution signals.
An example for loading neutron rate into a numpy array is below
from d3d_loaders.signal0d import signal_neut
shotnr = 169113
t_params = {'tstart' : tstart,
'tend' : tend,
'tsample': -1
}
sig_neut = signal_neut(shotnr, t_params)
data = sig_neut.data.numpy()
Training examples can be found in this repository. The dataset these examples operate on can be found here.
The folder examples
contains several notebooks that illustrate how to train predictive
models using the infrastructure provided in this package. In particular, the notebook
AE_pred_LSTMtest_1723xx.ipynb
shows how to train a LSTM-based model for predictive
tasks. And the nodebook AE_pred_transformer_154xxx.ipynb
shows how to train a transformer-based
model for predictive tasks. Both notebooks rely on the dataset and dataloader infrastructure
implemented in this package.
The datasets used by the notebooks are located on Princeton's gpfs, accessible on
stellar
and traverse
: stellar:/projects/EKOLEMEN/AE_datasets
. They have been compiled
using the d3d_signals package. Definitions of
the datasets in yaml
format is here.
There is an additional README
with notes on issues that arose when compiling the dataset.