-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet regression tests #1091
Parquet regression tests #1091
Conversation
Interestingly this pandas-dev/pandas#34777 (comment) seems to work (and round trip) perfectly well for integer columns. |
I actually found the mentioned GitHub issue at the time which is what made me update to pandas 1.2. But their example is still using string column names. I think that was the error that was raised when saving a multi index column somewhat confusingly. |
It should pass now... |
Oh yeah, it won't pass until we merge that report engine PR, but this passes locally for me |
I guess this is also the reason the bot is failing as well... https://github.com/NNPDF/nnpdf/pull/1084/checks?check_run_id=1910677239#step:12:279 ? |
Yes that error message is very likely from here, but what does it have to do with 1084? |
Blocked by NNPDF/reportengine#40 |
@RosalynLP @siranipour I have been looking a bit into this. I think my original objection about hdf5, namely that is a cumbersome old heavyweight dependency, is somewhat weekended by the fact that parquet (with apache arrow) is also incredibly heavy and complicated (for that reason at some point reportengine was depending on boost!). So that cannot be a point in favour of parquet. In any case, I'd prefer it if these were all optional dependencies of reportengine. It is also true that hd5 is documented as round tripping with pandas, which is clearly an advantage. What I am not so sure about is the speed and filesize. E.g. the numbers here https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d are rather more favourable for parquet, and we want this to be used for fktables eventually. See e.g. #404 (comment). Could we get some numbers on filesizes and speeds for say the biggest fktable? |
@Zaharid I also saw this article and I thought although it's not ideal it's at least better than csv ... although I forgot about the FK tables |
Note that there is this https://flit.readthedocs.io/en/latest/cmdline.html#cmdoption-flit-install-extras |
Sure, I can take care of benchmarking the two, if that's okay with you @RosalynLP? |
@siranipour thank you so much :) Lmk if you want help |
Ok I've done this benchmark for writing and reading the It looks like hdf is the quickest both in terms of read and write, while parquet is the most efficient in terms of storage (would be nice for theory downloading) with hdf showing no real improvement. Note that for parquet we need to call I also attach the Jupyter notebook. |
Thanks for this @siranipour! So following this I would still be in favour of hdf, when you combine this with how easily it roundtrips. I was a little confused by what you meant by "no real improvement" though - could you clarify? What is the comparison you mean? Edit: Sorry of course you mean compared to the current file size, ignore me! |
Thanks @siranipour! I think write times do not matter at all, read times matter a lot and small files are rather important as well. So we have 30% improvement in read time and some 40% in file size in exchange for the annoyance of dealing with non-round tripping. |
Yeah, I found while benchmarking that the annoying round trip thing is mostly a one line fix, i.e just put Also I noticed pandas themselves use parquet for their large data tutorials https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#load-less-data I'm happy with either at this point |
Nice, in that case we might as well go with parquet, right? |
Broke arclengths covmatreg and rebuilddata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, it's nice to simplify these things using parquet... we just have to be careful where we've made the columns strings... I guess it's not that important here because we're only testing but elsewhere we have to check that all dataframes will have string column labels such that the objects can interact properly .
Co-authored-by: Rosalyn Pearson <[email protected]>
The trouble at the moment is if you have a # runcard.yaml
dataset_inputs:
- {dataset: NMCPD, frac: 0.75}
- {dataset: NMC, frac: 0.75}
- {dataset: SLACP, frac: 0.75}
- {dataset: SLACD, frac: 0.75}
theoryid: 53
use_cuts: nocuts
template_text: |
{@groups_covmat@}
actions_:
- report(main=True) and run but as you say, it's fine for testing currently, should we fix this? |
I think maybe we can fix it in a separate PR, but we should make that next PR a matter of priority. |
What shall we do with this PR? |
Work on it probably. |
As in shall we keep the parquet format just for the regression tests? |
I would like to export tables in parquet format. |
Working on the "destroyingc++" tag I would actually want to read the current fktables with parquet as well because the difference between Which I guess would mean exporting them in parquet format first from the current csv? Is that a possibility? |
Yes, I would very much like that. Reading from parquet is faster than cpp. One would need to add an easy step to apfelcomb and also do for all the theories. An option would be that validphys does it locally... Incidentally as discussed elsewhere I thing that pinnaple grids should also be stored in some well known format with lots of compatibility. |
I guess it is better to add the option. A small script is fine I guess.
Maybe @cschwan can comment. For the time being I will be using whatever parser is provided by |
PineAPPL grids don't have a dedicated file format, but instead it uses serde to (de)serialize the |
I think serde and afaict bincode are in the same category as pickle, so way too tied to the (current version of the) implementation and difficult to interface with external codes, so good for network exchange but not so much for storing things permanently. Ideally a person would be able to write a read pinnaple grids without the rust library and the format would contain as much information as possible. |
I'm not sure what you mean with 'way too tied', but you can easily replace bincode with JSON, for instance, by changing the code here, In particular, you have to change the two lines containing The fact you can't write a PineAPPL grid without its (Python/Rust/C/...) library is a feature in my opinion, because generating the contents is non-trivial. I would advise against it, especially since we now have a Python library that doesn't depend on anything in Rust. Maybe @alecandido can convince you how nice it is 😃 ! |
There are all the use cases that you didn't think of that would benefit for being able to produce or edit these things in various ways. For example if I had needed to modify the LHAPDF C++ code to add some function to store more metadata (and recompile it and distribute it and so on) then I am pretty sure the alphas fit would have never happened for example. |
That's a good point. To support this I can add an |
Once more we are talking of multidimensional data (pineappl indeed has a bunch of dimensions shared with eko), so I strongly advise against the use of Parquet, because it's only meaningful to store tables and nothing else, otherwise you're just going to implement a parser on top of Parquet that is in no way better than binary. If you (@Zaharid) think about metadata (like in the case of lhapdf) they can already be stored inside a PineAPPL grid, and that one is most likely the only meaningful thing that you can really If a standard format is needed for multidimensional data the only good one is NetCDF, that might be more appropriate for a PineAPPL grid than for eko, but still the structure of a Grid is more involved than just a 6 dimensional array, so I wonder if it's worth at all (Subgrids can have a very different structure, if you convert all of them to a dense LagrangeSubgrid than it might be fine). |
Re parquet: this would be just a workaround for the slow current fktales. To discuss the "new" fktables maybe you want to open a new issue. I just mentioned it here because if it is used for the Regression test it might as well be used for the (again, current) fktables. |
Yes, we should open a new issue indeed. We could look into other things but I believe the issue of multidimensional data is a it of a red herring: if you want to store more than two dimensions pick two for the table and use a multiindex for the rest. Indeed as far as I remember the current |
Since we have moved from the C++ reading (and new fktables are now pineappl) I think we should close this. In addition the regression tests have changed a few times in the last two years. (also, I believe having the test regression in human readable / github diff-able |
Relies on: NNPDF/reportengine#40
This doesn't work, but I do need help. So it turns out parquet can't save columns that are not
dtype=str
(which is problematic e.g when saving covmats ). This is not an issue, I've just made it change tostr
before writing to disk on the report engine side..The issue comes on the vp side of testing. When it makes the comparison between dataframes, it loads the parquet file which has columns of type
Int64Index
while the loaded df has columns of typeIndex
.The assertion then fails because it compares columns. Any ideas?