Parquet regression tests #1091

siranipour · 2021-02-10T19:36:38Z

This doesn't work, but I do need help. So it turns out parquet can't save columns that are not dtype=str (which is problematic e.g when saving covmats ). This is not an issue, I've just made it change to str before writing to disk on the report engine side..

The issue comes on the vp side of testing. When it makes the comparison between dataframes, it loads the parquet file which has columns of type Int64Index while the loaded df has columns of type Index.

The assertion then fails because it compares columns. Any ideas?

Zaharid · 2021-02-11T00:18:26Z

Interestingly this

pandas-dev/pandas#34777 (comment)

seems to work (and round trip) perfectly well for integer columns.

siranipour · 2021-02-11T08:32:42Z

I actually found the mentioned GitHub issue at the time which is what made me update to pandas 1.2.

But their example is still using string column names. I think that was the error that was raised when saving a multi index column somewhat confusingly.

siranipour · 2021-02-16T13:12:59Z

It should pass now...

siranipour · 2021-02-16T13:23:22Z

Oh yeah, it won't pass until we merge that report engine PR, but this passes locally for me

scarlehoff · 2021-02-16T18:21:22Z

I guess this is also the reason the bot is failing as well... https://github.com/NNPDF/nnpdf/pull/1084/checks?check_run_id=1910677239#step:12:279 ?

siranipour · 2021-02-17T14:48:16Z

Yes that error message is very likely from here, but what does it have to do with 1084?

siranipour · 2021-03-12T18:53:53Z

Blocked by NNPDF/reportengine#40

Zaharid · 2021-03-18T15:47:26Z

@RosalynLP @siranipour I have been looking a bit into this. I think my original objection about hdf5, namely that is a cumbersome old heavyweight dependency, is somewhat weekended by the fact that parquet (with apache arrow) is also incredibly heavy and complicated (for that reason at some point reportengine was depending on boost!). So that cannot be a point in favour of parquet. In any case, I'd prefer it if these were all optional dependencies of reportengine.

It is also true that hd5 is documented as round tripping with pandas, which is clearly an advantage.

What I am not so sure about is the speed and filesize. E.g. the numbers here https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d are rather more favourable for parquet, and we want this to be used for fktables eventually. See e.g. #404 (comment). Could we get some numbers on filesizes and speeds for say the biggest fktable?

RosalynLP · 2021-03-18T16:54:45Z

@Zaharid I also saw this article and I thought although it's not ideal it's at least better than csv ... although I forgot about the FK tables

siranipour · 2021-03-21T15:34:17Z

Note that there is this https://flit.readthedocs.io/en/latest/cmdline.html#cmdoption-flit-install-extras

siranipour · 2021-03-21T18:32:41Z

What I am not so sure about is the speed and filesize. E.g. the numbers here https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d are rather more favourable for parquet, and we want this to be used for fktables eventually. See e.g. #404 (comment). Could we get some numbers on filesizes and speeds for say the biggest fktable?

Sure, I can take care of benchmarking the two, if that's okay with you @RosalynLP?

RosalynLP · 2021-03-21T19:57:24Z

What I am not so sure about is the speed and filesize. E.g. the numbers here https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d are rather more favourable for parquet, and we want this to be used for fktables eventually. See e.g. #404 (comment). Could we get some numbers on filesizes and speeds for say the biggest fktable?

Sure, I can take care of benchmarking the two, if that's okay with you @RosalynLP?

@siranipour thank you so much :) Lmk if you want help

siranipour · 2021-03-22T15:57:30Z

Ok I've done this benchmark for writing and reading the CMS_1JET_8TEV FKTable which is the biggest one we have (331M).

It looks like hdf is the quickest both in terms of read and write, while parquet is the most efficient in terms of storage (would be nice for theory downloading) with hdf showing no real improvement. Note that for parquet we need to call df = reportengine.table.str_columns(df) before saving, although this is not the main bottle neck.

I also attach the Jupyter notebook.

format_benchmark.pdf

RosalynLP · 2021-03-22T16:10:54Z

Thanks for this @siranipour! So following this I would still be in favour of hdf, when you combine this with how easily it roundtrips. I was a little confused by what you meant by "no real improvement" though - could you clarify? What is the comparison you mean?

Edit: Sorry of course you mean compared to the current file size, ignore me!

Zaharid · 2021-03-22T19:58:02Z

Thanks @siranipour!

I think write times do not matter at all, read times matter a lot and small files are rather important as well. So we have 30% improvement in read time and some 40% in file size in exchange for the annoyance of dealing with non-round tripping.

siranipour · 2021-03-23T18:17:23Z

Yeah, I found while benchmarking that the annoying round trip thing is mostly a one line fix, i.e just put df = str_columns(df) before df.to_parquet. I do quite fancy the idea of fast downloading theories though.

Also I noticed pandas themselves use parquet for their large data tutorials https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#load-less-data

I'm happy with either at this point

RosalynLP · 2021-03-23T19:08:44Z

Nice, in that case we might as well go with parquet, right?

RosalynLP · 2021-03-23T19:09:44Z

Also is it just me but every time I see the work parquet I can't help the image

although apparently I should be thinking

Broke arclengths covmatreg and rebuilddata

RosalynLP

This makes sense, it's nice to simplify these things using parquet... we just have to be careful where we've made the columns strings... I guess it's not that important here because we're only testing but elsewhere we have to check that all dataframes will have string column labels such that the objects can interact properly .

validphys2/src/validphys/tests/test_rebuilddata.py

conda-recipe/meta.yaml

Co-authored-by: Rosalyn Pearson <[email protected]>

siranipour · 2021-04-04T18:15:38Z

The trouble at the moment is if you have a @table decorator for one of your providers that have non-string columns , then it crashes:

# runcard.yaml
dataset_inputs:
- {dataset: NMCPD, frac: 0.75}
- {dataset: NMC, frac: 0.75}
- {dataset: SLACP, frac: 0.75}
- {dataset: SLACD, frac: 0.75}

theoryid: 53

use_cuts: nocuts

template_text: |
  {@groups_covmat@}

actions_:
  - report(main=True)

and run validphys runcard.yaml --table-format parquet

but as you say, it's fine for testing currently, should we fix this?

RosalynLP · 2021-04-05T12:44:14Z

I think maybe we can fix it in a separate PR, but we should make that next PR a matter of priority.

siranipour · 2021-05-28T10:26:31Z

What shall we do with this PR?

Zaharid · 2021-05-28T10:28:18Z

Work on it probably.

siranipour · 2021-05-28T10:29:08Z

As in shall we keep the parquet format just for the regression tests?

Zaharid · 2021-05-28T12:04:54Z

I would like to export tables in parquet format.

scarlehoff · 2021-10-14T15:06:48Z

Working on the "destroyingc++" tag I would actually want to read the current fktables with parquet as well because the difference between read_csv and the C++ parser is quite notable...

Which I guess would mean exporting them in parquet format first from the current csv?

Is that a possibility?

Zaharid · 2021-10-14T16:53:46Z

Yes, I would very much like that. Reading from parquet is faster than cpp. One would need to add an easy step to apfelcomb and also do for all the theories. An option would be that validphys does it locally...

Incidentally as discussed elsewhere I thing that pinnaple grids should also be stored in some well known format with lots of compatibility.

scarlehoff · 2021-10-14T17:07:31Z

Yes, I would very much like that. Reading from parquet is faster than cpp. One would need to add an easy step to apfelcomb and also do for all the theories. An option would be that validphys does it locally...

I guess it is better to add the option. A small script is fine I guess.

Incidentally as discussed elsewhere I thing that pinnaple grids should also be stored in some well known format with lots of compatibility.

Maybe @cschwan can comment. For the time being I will be using whatever parser is provided by pineappl so if it is changed internally it should not affect us.

cschwan · 2021-10-15T09:14:20Z

Incidentally as discussed elsewhere I thing that pinnaple grids should also be stored in some well known format with lots of compatibility.

Maybe @cschwan can comment. For the time being I will be using whatever parser is provided by pineappl so if it is changed internally it should not affect us.

PineAPPL grids don't have a dedicated file format, but instead it uses serde to (de)serialize the Grid struct. The actual file format can be easily changed, and there are quite a number of format that can be used. I'm using bincode to write and read files which turned out to be the fastest when I tested it; however, neglecting backwards compatibility, we can change it to anything. I'm not quite sure if Parquet is support, but I found Avro in that list. Is that it or something different?

Zaharid · 2021-10-15T09:27:52Z

I think serde and afaict bincode are in the same category as pickle, so way too tied to the (current version of the) implementation and difficult to interface with external codes, so good for network exchange but not so much for storing things permanently. Ideally a person would be able to write a read pinnaple grids without the rust library and the format would contain as much information as possible.

cschwan · 2021-10-15T09:59:31Z

I think serde and afaict bincode are in the same category as pickle, so way too tied to the (current version of the) implementation and difficult to interface with external codes, so good for network exchange but not so much for storing things permanently. Ideally a person would be able to write a read pinnaple grids without the rust library and the format would contain as much information as possible.

I'm not sure what you mean with 'way too tied', but you can easily replace bincode with JSON, for instance, by changing the code here,

https://github.com/N3PDF/pineappl/blob/4c931effc276ccdad99600678b0538d9a74a6bfc/pineappl/src/grid.rs#L655-L678

In particular, you have to change the two lines containing bincode with the corresponding code for JSON. I'd say that shows PineAPPL isn't too tied to bincode.

The fact you can't write a PineAPPL grid without its (Python/Rust/C/...) library is a feature in my opinion, because generating the contents is non-trivial. I would advise against it, especially since we now have a Python library that doesn't depend on anything in Rust. Maybe @alecandido can convince you how nice it is 😃 !

Zaharid · 2021-10-15T10:30:42Z

The fact you can't write a PineAPPL grid without its (Python/Rust/C/...) library is a feature in my opinion, because generating the contents is non-trivial. I would advise against it, especially since we now have a Python library that doesn't depend on anything in Rust. Maybe @alecandido can convince you how nice it is smiley !

There are all the use cases that you didn't think of that would benefit for being able to produce or edit these things in various ways. For example if I had needed to modify the LHAPDF C++ code to add some function to store more metadata (and recompile it and distribute it and so on) then I am pretty sure the alphas fit would have never happened for example.

cschwan · 2021-10-15T10:40:52Z

There are all the use cases that you didn't think of that would benefit for being able to produce or edit these things in various ways. For example if I had needed to modify the LHAPDF C++ code to add some function to store more metadata (and recompile it and distribute it and so on) then I am pretty sure the alphas fit would have never happened for example.

That's a good point. To support this I can add an export subcommand, which converts a grid into a known format that you can edit, and then also add an import subcommand to produce a 'proper' grid again.

alecandido · 2021-10-15T11:27:43Z

Once more we are talking of multidimensional data (pineappl indeed has a bunch of dimensions shared with eko), so I strongly advise against the use of Parquet, because it's only meaningful to store tables and nothing else, otherwise you're just going to implement a parser on top of Parquet that is in no way better than binary.

If you (@Zaharid) think about metadata (like in the case of lhapdf) they can already be stored inside a PineAPPL grid, and that one is most likely the only meaningful thing that you can really export and import from and to a YAML, or even a JSON or whatever you prefer.

If a standard format is needed for multidimensional data the only good one is NetCDF, that might be more appropriate for a PineAPPL grid than for eko, but still the structure of a Grid is more involved than just a 6 dimensional array, so I wonder if it's worth at all (Subgrids can have a very different structure, if you convert all of them to a dense LagrangeSubgrid than it might be fine).

scarlehoff · 2021-10-15T11:39:37Z

Re parquet: this would be just a workaround for the slow current fktales. To discuss the "new" fktables maybe you want to open a new issue.

I just mentioned it here because if it is used for the Regression test it might as well be used for the (again, current) fktables.

Zaharid · 2021-10-15T12:55:02Z

Yes, we should open a new issue indeed. We could look into other things but I believe the issue of multidimensional data is a it of a red herring: if you want to store more than two dimensions pick two for the table and use a multiindex for the rest. Indeed as far as I remember the current sigma dataframe (which is a multiindexed object) trivially roundtrips with parquet (which is how I rant the tests showing it was faster).

scarlehoff · 2023-03-14T15:09:04Z

Since we have moved from the C++ reading (and new fktables are now pineappl) I think we should close this. In addition the regression tests have changed a few times in the last two years.

(also, I believe having the test regression in human readable / github diff-able csv is actually very useful)

siranipour marked this pull request as draft February 10, 2021 19:37

siranipour marked this pull request as ready for review February 18, 2021 14:20

siranipour requested a review from Zaharid February 18, 2021 14:20

RosalynLP self-requested a review February 26, 2021 10:17

siranipour force-pushed the parquet branch from 5454090 to 2dfefff Compare March 12, 2021 18:49

siranipour added 7 commits March 29, 2021 17:16

Parquet regression tests

2d3bc24

Making produced table have string columns

1329d35

Fixing other tests

c6b7dad

Broke arclengths covmatreg and rebuilddata

Removing unused functions

466037c

Removing check_dtype flag

690004d

Changes in line with reportengine changes

cf87a2a

Updating mcreplica regression

ba19098

Making art_rep_generation test parquet

a169639

siranipour force-pushed the parquet branch from 3114948 to a169639 Compare March 29, 2021 16:22

Adding pyarrow dep

a85f4b7

RosalynLP reviewed Apr 1, 2021

View reviewed changes

validphys2/src/validphys/tests/test_rebuilddata.py Show resolved Hide resolved

RosalynLP reviewed Apr 1, 2021

View reviewed changes

conda-recipe/meta.yaml Outdated Show resolved Hide resolved

Update conda-recipe/meta.yaml

d3060ef

Co-authored-by: Rosalyn Pearson <[email protected]>

scarlehoff mentioned this pull request Oct 14, 2021

Remove libNNPDF::LHAPDFSet for a vp-based class #1433

Closed

1 task

scarlehoff marked this pull request as draft November 22, 2021 12:38

Zaharid mentioned this pull request Mar 10, 2022

Fit with theory covmat with n3fit #1528

Merged

scarlehoff closed this Mar 14, 2023

scarlehoff deleted the parquet branch March 14, 2023 15:13

Parquet regression tests #1091

Parquet regression tests #1091

Conversation

siranipour commented Feb 10, 2021

Zaharid commented Feb 11, 2021

siranipour commented Feb 11, 2021

siranipour commented Feb 16, 2021

siranipour commented Feb 16, 2021

scarlehoff commented Feb 16, 2021 • edited Loading

siranipour commented Feb 17, 2021

siranipour commented Mar 12, 2021

Zaharid commented Mar 18, 2021

RosalynLP commented Mar 18, 2021

siranipour commented Mar 21, 2021

siranipour commented Mar 21, 2021

RosalynLP commented Mar 21, 2021

siranipour commented Mar 22, 2021

RosalynLP commented Mar 22, 2021 • edited Loading

Zaharid commented Mar 22, 2021

siranipour commented Mar 23, 2021

RosalynLP commented Mar 23, 2021

RosalynLP commented Mar 23, 2021 • edited Loading

RosalynLP left a comment

Choose a reason for hiding this comment

siranipour commented Apr 4, 2021 • edited Loading

RosalynLP commented Apr 5, 2021 • edited Loading

siranipour commented May 28, 2021

Zaharid commented May 28, 2021

siranipour commented May 28, 2021

Zaharid commented May 28, 2021

scarlehoff commented Oct 14, 2021 • edited Loading

Zaharid commented Oct 14, 2021

scarlehoff commented Oct 14, 2021

cschwan commented Oct 15, 2021

Zaharid commented Oct 15, 2021

cschwan commented Oct 15, 2021

Zaharid commented Oct 15, 2021

cschwan commented Oct 15, 2021

alecandido commented Oct 15, 2021

scarlehoff commented Oct 15, 2021

Zaharid commented Oct 15, 2021

scarlehoff commented Mar 14, 2023 • edited Loading

scarlehoff commented Feb 16, 2021 •

edited

Loading

RosalynLP commented Mar 22, 2021 •

edited

Loading

RosalynLP commented Mar 23, 2021 •

edited

Loading

siranipour commented Apr 4, 2021 •

edited

Loading

RosalynLP commented Apr 5, 2021 •

edited

Loading

scarlehoff commented Oct 14, 2021 •

edited

Loading

scarlehoff commented Mar 14, 2023 •

edited

Loading