groot/bench-opendata: fix task-7 #5

sbinet · 2020-04-16T17:36:32Z

consider all leptons as a whole
correct isolation criteria

Fixes #4.

- consider all leptons as a whole - correct isolation criteria Fixes go-hep#4.

ingomueller-net · 2020-04-17T10:13:39Z

Thanks for the fast reply and fix! The code looks good to me (but I learned Go just for the purpose of running this benchmark, so don't trust me). I did confirm that the resulting histogram is exactly the same as three other implementations: (1) mine in Go, (2) the Coffea implementation, (3) one I am writing in SQL.

sbinet · 2020-04-17T10:42:46Z

cool!

(out of curiosity, how was learning Go? would you recommand it for other HEP/astro/... scientists?)

ingomueller-net · 2020-04-17T11:58:04Z

I found it at least fairly readable; most of my "learning" consisted of guessing what your code does. It also seems pretty concise. I don't know about performance, which is probably one aspect scientists care about. Also, as a researcher in database systems, I think that SQL might be a good (better?) choice ;)

sbinet · 2020-04-17T12:06:02Z

thanks.

Also, as a researcher in database systems, I think that SQL might be a good (better?) choice ;)

that last statement may be a bit biased :P
at least it isn't the same than mine.
even if I did dabble into those waters:

(so once you have rewritten all these examples into a set of fancy SQL queries, you could do the same with groot/rsql :P)

ingomueller-net · 2020-04-17T13:14:43Z

Yes, I suspect so :) Thanks a lot for the pointers, I'll take a look! May I ask why you abandonded that direction? Why is SQL not suited?

sbinet · 2020-04-17T13:53:40Z

SQL is fine but a bit verbose for "Joe the physicist".
@jpivarski has been working on creating an SQL-like language more (better?) tailored at the kind of work HEP people do:

https://github.com/jpivarski/PartiQL

jpivarski · 2020-04-17T15:02:16Z

PartiQL was an experiment with the idea of an SQL model (collection of tables with different lengths, related to one another through foreign keys) applied to each event of a physics dataset. When an SQL model is applied to whole datasets (i.e. a "table of events") then it's possible to express HEP queries, but extremely cumbersome because every query is a "join on eventId" as we're never interested in relating quantities from one event to another event, and the handling of combinatorics of particles within an event gets intricate. This question I asked 4 years ago convinced me that the SQL model is inappropriate for HEP datasets.

However, applying an SQL model to each event of a HEP dataset does make some sense: each particle collection (e.g. jets, muons, ...) has a different number of rows and possibly different columns for attributes, so each particle collection in one event is an SQL table. We have connections between these tables using foreign keys, usually row number for a particular sorting of the tables—this is a surrogate key.

This event-as-database was Gordon Watts's idea, which he is developing as func_adl. PartiQL was my experiment with the idea: it was originally going to be SQL syntax, but real SQL was still too cumbersome (we always want to join on these surrogate keys, never any natural keys), so only a few keywords like JOIN are derived from SQL. It's also being more seriously developed by Lindsey Gray as AwkwardQL.

Finally, if we do go the event-as-database route, it would be an entirely different database implementation from what's out there. Most databases are optimized for big tables; this would have to be optimized for many little tables, as the number of particles per event is 0‒2 for some particle types. Instead of a table with a billion rows, we have a billion tables with maybe 1 row. Time complexity for each operation (i.e. O(n) vs O(log(n)) where n is the number of elements in a table; the number that participate in a join) becomes much less important than setup and tear-down times when iterating from one event (collection of tables) to the next. It's an unusual problem that would require unusual solutions.

Surely I'm digressing from the focus of this PR, though...

groot/bench-opendata: fix task-7

fb2d11c

- consider all leptons as a whole - correct isolation criteria Fixes go-hep#4.

sbinet mentioned this pull request Apr 16, 2020

Possible bugs in Task 7 #4

Closed

sbinet merged commit 34f4405 into go-hep:master Apr 17, 2020

sbinet deleted the issue-4 branch April 17, 2020 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groot/bench-opendata: fix task-7 #5

groot/bench-opendata: fix task-7 #5

sbinet commented Apr 16, 2020

ingomueller-net commented Apr 17, 2020

sbinet commented Apr 17, 2020

ingomueller-net commented Apr 17, 2020

sbinet commented Apr 17, 2020

ingomueller-net commented Apr 17, 2020

sbinet commented Apr 17, 2020

jpivarski commented Apr 17, 2020

groot/bench-opendata: fix task-7 #5

groot/bench-opendata: fix task-7 #5

Conversation

sbinet commented Apr 16, 2020

ingomueller-net commented Apr 17, 2020

sbinet commented Apr 17, 2020

ingomueller-net commented Apr 17, 2020

sbinet commented Apr 17, 2020

ingomueller-net commented Apr 17, 2020

sbinet commented Apr 17, 2020

jpivarski commented Apr 17, 2020