Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet regression tests #1091

Closed
wants to merge 10 commits into from
Closed

Parquet regression tests #1091

wants to merge 10 commits into from

Conversation

siranipour
Copy link
Contributor

Relies on: NNPDF/reportengine#40

This doesn't work, but I do need help. So it turns out parquet can't save columns that are not dtype=str (which is problematic e.g when saving covmats ). This is not an issue, I've just made it change to str before writing to disk on the report engine side..

The issue comes on the vp side of testing. When it makes the comparison between dataframes, it loads the parquet file which has columns of type Int64Index while the loaded df has columns of type Index.

The assertion then fails because it compares columns. Any ideas?

@siranipour siranipour marked this pull request as draft February 10, 2021 19:37
@Zaharid
Copy link
Contributor

Zaharid commented Feb 11, 2021

Interestingly this

pandas-dev/pandas#34777 (comment)

seems to work (and round trip) perfectly well for integer columns.

@siranipour
Copy link
Contributor Author

I actually found the mentioned GitHub issue at the time which is what made me update to pandas 1.2.

But their example is still using string column names. I think that was the error that was raised when saving a multi index column somewhat confusingly.

@siranipour
Copy link
Contributor Author

It should pass now...

@siranipour
Copy link
Contributor Author

Oh yeah, it won't pass until we merge that report engine PR, but this passes locally for me

@scarlehoff
Copy link
Member

scarlehoff commented Feb 16, 2021

I guess this is also the reason the bot is failing as well... https://github.com/NNPDF/nnpdf/pull/1084/checks?check_run_id=1910677239#step:12:279 ?

@siranipour
Copy link
Contributor Author

Yes that error message is very likely from here, but what does it have to do with 1084?

@siranipour
Copy link
Contributor Author

Blocked by NNPDF/reportengine#40

@Zaharid
Copy link
Contributor

Zaharid commented Mar 18, 2021

@RosalynLP @siranipour I have been looking a bit into this. I think my original objection about hdf5, namely that is a cumbersome old heavyweight dependency, is somewhat weekended by the fact that parquet (with apache arrow) is also incredibly heavy and complicated (for that reason at some point reportengine was depending on boost!). So that cannot be a point in favour of parquet. In any case, I'd prefer it if these were all optional dependencies of reportengine.

It is also true that hd5 is documented as round tripping with pandas, which is clearly an advantage.

What I am not so sure about is the speed and filesize. E.g. the numbers here https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d are rather more favourable for parquet, and we want this to be used for fktables eventually. See e.g. #404 (comment). Could we get some numbers on filesizes and speeds for say the biggest fktable?

@RosalynLP
Copy link
Contributor

@Zaharid I also saw this article and I thought although it's not ideal it's at least better than csv ... although I forgot about the FK tables

@siranipour
Copy link
Contributor Author

@siranipour
Copy link
Contributor Author

What I am not so sure about is the speed and filesize. E.g. the numbers here https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d are rather more favourable for parquet, and we want this to be used for fktables eventually. See e.g. #404 (comment). Could we get some numbers on filesizes and speeds for say the biggest fktable?

Sure, I can take care of benchmarking the two, if that's okay with you @RosalynLP?

@RosalynLP
Copy link
Contributor

What I am not so sure about is the speed and filesize. E.g. the numbers here https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d are rather more favourable for parquet, and we want this to be used for fktables eventually. See e.g. #404 (comment). Could we get some numbers on filesizes and speeds for say the biggest fktable?

Sure, I can take care of benchmarking the two, if that's okay with you @RosalynLP?

@siranipour thank you so much :) Lmk if you want help

@siranipour
Copy link
Contributor Author

Ok I've done this benchmark for writing and reading the CMS_1JET_8TEV FKTable which is the biggest one we have (331M).

format_benchmark.pdf

It looks like hdf is the quickest both in terms of read and write, while parquet is the most efficient in terms of storage (would be nice for theory downloading) with hdf showing no real improvement. Note that for parquet we need to call df = reportengine.table.str_columns(df) before saving, although this is not the main bottle neck.

I also attach the Jupyter notebook.

format_benchmark.pdf

@RosalynLP
Copy link
Contributor

RosalynLP commented Mar 22, 2021

Thanks for this @siranipour! So following this I would still be in favour of hdf, when you combine this with how easily it roundtrips. I was a little confused by what you meant by "no real improvement" though - could you clarify? What is the comparison you mean?

Edit: Sorry of course you mean compared to the current file size, ignore me!

@Zaharid
Copy link
Contributor

Zaharid commented Mar 22, 2021

Thanks @siranipour!

I think write times do not matter at all, read times matter a lot and small files are rather important as well. So we have 30% improvement in read time and some 40% in file size in exchange for the annoyance of dealing with non-round tripping.

@siranipour
Copy link
Contributor Author

Yeah, I found while benchmarking that the annoying round trip thing is mostly a one line fix, i.e just put df = str_columns(df) before df.to_parquet. I do quite fancy the idea of fast downloading theories though.

Also I noticed pandas themselves use parquet for their large data tutorials https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#load-less-data

I'm happy with either at this point

@RosalynLP
Copy link
Contributor

Nice, in that case we might as well go with parquet, right?

@RosalynLP
Copy link
Contributor

RosalynLP commented Mar 23, 2021

Also is it just me but every time I see the work parquet I can't help the image
image
although apparently I should be thinking
image

Copy link
Contributor

@RosalynLP RosalynLP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, it's nice to simplify these things using parquet... we just have to be careful where we've made the columns strings... I guess it's not that important here because we're only testing but elsewhere we have to check that all dataframes will have string column labels such that the objects can interact properly .

conda-recipe/meta.yaml Outdated Show resolved Hide resolved
Co-authored-by: Rosalyn Pearson <[email protected]>
@siranipour
Copy link
Contributor Author

siranipour commented Apr 4, 2021

The trouble at the moment is if you have a @table decorator for one of your providers that have non-string columns , then it crashes:

# runcard.yaml
dataset_inputs:
- {dataset: NMCPD, frac: 0.75}
- {dataset: NMC, frac: 0.75}
- {dataset: SLACP, frac: 0.75}
- {dataset: SLACD, frac: 0.75}

theoryid: 53

use_cuts: nocuts

template_text: |
  {@groups_covmat@}

actions_:
  - report(main=True)

and run validphys runcard.yaml --table-format parquet

but as you say, it's fine for testing currently, should we fix this?

@RosalynLP
Copy link
Contributor

RosalynLP commented Apr 5, 2021

I think maybe we can fix it in a separate PR, but we should make that next PR a matter of priority.

@siranipour
Copy link
Contributor Author

What shall we do with this PR?

@Zaharid
Copy link
Contributor

Zaharid commented May 28, 2021

Work on it probably.

@siranipour
Copy link
Contributor Author

As in shall we keep the parquet format just for the regression tests?

@Zaharid
Copy link
Contributor

Zaharid commented May 28, 2021

I would like to export tables in parquet format.

@scarlehoff
Copy link
Member

scarlehoff commented Oct 14, 2021

Working on the "destroyingc++" tag I would actually want to read the current fktables with parquet as well because the difference between read_csv and the C++ parser is quite notable...

Which I guess would mean exporting them in parquet format first from the current csv?

Is that a possibility?

@Zaharid
Copy link
Contributor

Zaharid commented Oct 14, 2021

Yes, I would very much like that. Reading from parquet is faster than cpp. One would need to add an easy step to apfelcomb and also do for all the theories. An option would be that validphys does it locally...

Incidentally as discussed elsewhere I thing that pinnaple grids should also be stored in some well known format with lots of compatibility.

@scarlehoff
Copy link
Member

Yes, I would very much like that. Reading from parquet is faster than cpp. One would need to add an easy step to apfelcomb and also do for all the theories. An option would be that validphys does it locally...

I guess it is better to add the option. A small script is fine I guess.

Incidentally as discussed elsewhere I thing that pinnaple grids should also be stored in some well known format with lots of compatibility.

Maybe @cschwan can comment. For the time being I will be using whatever parser is provided by pineappl so if it is changed internally it should not affect us.

@cschwan
Copy link
Contributor

cschwan commented Oct 15, 2021

Incidentally as discussed elsewhere I thing that pinnaple grids should also be stored in some well known format with lots of compatibility.

Maybe @cschwan can comment. For the time being I will be using whatever parser is provided by pineappl so if it is changed internally it should not affect us.

PineAPPL grids don't have a dedicated file format, but instead it uses serde to (de)serialize the Grid struct. The actual file format can be easily changed, and there are quite a number of format that can be used. I'm using bincode to write and read files which turned out to be the fastest when I tested it; however, neglecting backwards compatibility, we can change it to anything. I'm not quite sure if Parquet is support, but I found Avro in that list. Is that it or something different?

@Zaharid
Copy link
Contributor

Zaharid commented Oct 15, 2021

I think serde and afaict bincode are in the same category as pickle, so way too tied to the (current version of the) implementation and difficult to interface with external codes, so good for network exchange but not so much for storing things permanently. Ideally a person would be able to write a read pinnaple grids without the rust library and the format would contain as much information as possible.

@cschwan
Copy link
Contributor

cschwan commented Oct 15, 2021

I think serde and afaict bincode are in the same category as pickle, so way too tied to the (current version of the) implementation and difficult to interface with external codes, so good for network exchange but not so much for storing things permanently. Ideally a person would be able to write a read pinnaple grids without the rust library and the format would contain as much information as possible.

I'm not sure what you mean with 'way too tied', but you can easily replace bincode with JSON, for instance, by changing the code here,

https://github.com/N3PDF/pineappl/blob/4c931effc276ccdad99600678b0538d9a74a6bfc/pineappl/src/grid.rs#L655-L678

In particular, you have to change the two lines containing bincode with the corresponding code for JSON. I'd say that shows PineAPPL isn't too tied to bincode.

The fact you can't write a PineAPPL grid without its (Python/Rust/C/...) library is a feature in my opinion, because generating the contents is non-trivial. I would advise against it, especially since we now have a Python library that doesn't depend on anything in Rust. Maybe @alecandido can convince you how nice it is 😃 !

@Zaharid
Copy link
Contributor

Zaharid commented Oct 15, 2021

The fact you can't write a PineAPPL grid without its (Python/Rust/C/...) library is a feature in my opinion, because generating the contents is non-trivial. I would advise against it, especially since we now have a Python library that doesn't depend on anything in Rust. Maybe @alecandido can convince you how nice it is smiley !

There are all the use cases that you didn't think of that would benefit for being able to produce or edit these things in various ways. For example if I had needed to modify the LHAPDF C++ code to add some function to store more metadata (and recompile it and distribute it and so on) then I am pretty sure the alphas fit would have never happened for example.

@cschwan
Copy link
Contributor

cschwan commented Oct 15, 2021

There are all the use cases that you didn't think of that would benefit for being able to produce or edit these things in various ways. For example if I had needed to modify the LHAPDF C++ code to add some function to store more metadata (and recompile it and distribute it and so on) then I am pretty sure the alphas fit would have never happened for example.

That's a good point. To support this I can add an export subcommand, which converts a grid into a known format that you can edit, and then also add an import subcommand to produce a 'proper' grid again.

@alecandido
Copy link
Member

Once more we are talking of multidimensional data (pineappl indeed has a bunch of dimensions shared with eko), so I strongly advise against the use of Parquet, because it's only meaningful to store tables and nothing else, otherwise you're just going to implement a parser on top of Parquet that is in no way better than binary.

If you (@Zaharid) think about metadata (like in the case of lhapdf) they can already be stored inside a PineAPPL grid, and that one is most likely the only meaningful thing that you can really export and import from and to a YAML, or even a JSON or whatever you prefer.

If a standard format is needed for multidimensional data the only good one is NetCDF, that might be more appropriate for a PineAPPL grid than for eko, but still the structure of a Grid is more involved than just a 6 dimensional array, so I wonder if it's worth at all (Subgrids can have a very different structure, if you convert all of them to a dense LagrangeSubgrid than it might be fine).

@scarlehoff
Copy link
Member

Re parquet: this would be just a workaround for the slow current fktales. To discuss the "new" fktables maybe you want to open a new issue.

I just mentioned it here because if it is used for the Regression test it might as well be used for the (again, current) fktables.

@Zaharid
Copy link
Contributor

Zaharid commented Oct 15, 2021

Yes, we should open a new issue indeed. We could look into other things but I believe the issue of multidimensional data is a it of a red herring: if you want to store more than two dimensions pick two for the table and use a multiindex for the rest. Indeed as far as I remember the current sigma dataframe (which is a multiindexed object) trivially roundtrips with parquet (which is how I rant the tests showing it was faster).

@scarlehoff scarlehoff marked this pull request as draft November 22, 2021 12:38
@scarlehoff
Copy link
Member

scarlehoff commented Mar 14, 2023

Since we have moved from the C++ reading (and new fktables are now pineappl) I think we should close this. In addition the regression tests have changed a few times in the last two years.

(also, I believe having the test regression in human readable / github diff-able csv is actually very useful)

@scarlehoff scarlehoff closed this Mar 14, 2023
@scarlehoff scarlehoff deleted the parquet branch March 14, 2023 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants