Skip to content

Reading and (pre)processing speed

MaxvandenBoom edited this page Nov 19, 2023 · 1 revision

The time that it takes to read data accounts for a significant amount of the total analysis time. While reading data, there are several factors that influence the performance. Besides straightforward factors such as the speed of the physical drives (i.e. hard drives and solid-state drives) and the transfer speed of the hardware (e.g. type of connection), there are some situation dependent factors.

  • Fragmented vs Continuous: One situational factor is which data needs to be read: multiple fragmented pieces of data that are windowed around trial onsets, a single channel, or the entire dataset; And even when only pieces of data are required, more data might need to be read for preprocessing, for example when an entire channel needs to be high-pass filtered before epoching.

  • Reloading: When memory is limited and processing set to optimize for memory, channel data might need to be read multiple times to allow for the unloading of data from memory in between preprocessing steps.

  • Virtual mapping: Another factor that adds to the complexity of reading data from a physical drive is the mapping of data to virtual memory. The advantage of virtual memory is that once data is mapped into virtual memory, it allows for relatively fast retrieval. However, virtual memory is managed by the operating system and mapping often happens automatically (e.g. when a file is accessed often). While the app’s reading routines can be told to explicitly leverage virtual memory, the effect on the performance still depends on whether (segments of) data are loaded more than once and how well virtual memory management is implemented by the OS.

To determine the optimal settings for reading and (pre)processing, we benchmarked the low-level reading routines. Fig xx shows the results in speed for the EDF and Brainvision formats. Because it is unlikely that one specific dataset is analyzed repetitively, we assume that the app will be confronted with a different dataset on every run and we will focus on the uncached (non-virtual memory mapped) results.

FileIO functions

This section ...

BrainVision (Multiplexed)

  • TODO: add file details

Fig X. Read performance on BrainVision (Multiplexed) file format (TODO: add windows and mac system specifications)

BrainVision (Vectorized)

  • TODO: add file details

Fig X. Read performance on BrainVision (vectorized) file format (TODO: add windows and mac system specifications)

European Data Format

  • TODO: add file details

Fig X. Read performance on European Data Format file format (TODO: add windows and mac system specifications)

Conclusions

  • BV multiplexed and EDF have to make "jumps" in file to read and reorder the data of an entire channel to be consecutive. Therefore, the methods that read per (condition) trial are faster in all conditions.

    • Both for non-preloaded and preloaded
    • Both in epoch & average epoch
    • both for un-cached and cached
  • For BV vectorized:

    • epoch & averages: by (condition)trial faster in every condition
    • epoch
    • if not preloaded: by channel (not exactly sure why, maybe check on mac)
    • if preloaded: by trial as it doesn't make a significant difference
  • for MEF3

    • since the format is aimed at long parts of data, there might be something to say in favor of not going by channel
    • epoch & averages: by (condition) trial, seems faster for win 11 (only not win 7)
    • epoch: some benefit by channel (only for not-preload, not cached), but argument of longitudinal recordings gives preference for by (condition) trial

Epoch and average functions

This section ...

Epoch only

Fix X. Windows 11 <more specs, 64Gb> - Python 3.11 - few samples

Epoch and average

Fix X. Windows 11 <more specs, 64Gb> - Python 3.11 - 10 samples

Conclusions

...