DataSet::read over 2x slower than DataSet::read_raw for Eigen::Matrix #1051
-
I discovered that for Eigen::Matrix, I created a minimal example that reads a file with 3 rows and 3e8 columns, and another with 3e8 rows and 3 columns. create_input.py writes the input files. The On my laptop I get the following load times.
Both MATLAB and HDF5.jl reverse the dimensions. Would this be a better default behavior for HighFive? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Thank you for the suggestion. There will be several answers (which can be discussed separately). This answer explains the advantages of this choice. When writing the array to disk, the information that it used to be an For this to work, it needs to be implicitly understood which element corresponds to a particular row and column of the array. The only convention I'm familiar with is that The provided example demonstrates the issue quite nicely, when writing an What's nice about the current convention is that it's independent of the container used to store the array, it works for Same point once more: the format on disk is independent of how the matrix happens to be arranged in RAM. (Personally, I find this property very valuable.) |
Beta Was this translation helpful? Give feedback.
-
This answer is related to API stability. The proposed change can easily go unnoticed, e.g. in square matrices, but also in cases where one doesn't know that shape via an independent source of truth. By simply looking at the file, it's impossible to know if a Therefore, this change is quite error prone. It's also extremely hard to recover from, because one can't know if a particular file was written with Hence, given the state of HighFive, personally, I'm against making this type of breaking change. |
Beta Was this translation helpful? Give feedback.
-
Third answer, about the performance impact: It's (somewhat) surprising that the impact is this big. "Surprising" because RAM/nvme speeds would suggest that the impact should be less. "Somewhat" because we're very naive about how we transpose arrays. Hence, something we can do is look at optimizing the transpose (for special cases). |
Beta Was this translation helpful? Give feedback.
This answer is related to API stability.
The proposed change can easily go unnoticed, e.g. in square matrices, but also in cases where one doesn't know that shape via an independent source of truth. By simply looking at the file, it's impossible to know if a
[n, m]
array is genuinely[n, m]
or actually[m, n]
stored column major.Therefore, this change is quite error prone. It's also extremely hard to recover from, because one can't know if a particular file was written with
highfive<=3.0.0-rc1
or not. Hence one can't write code to mitigate the change, making it impossible to read old files with new HighFive correctly in a transparent way.Hence, given the state of HighFive, personally, I…