Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should equality between frames behave? #2141

Open
evansd opened this issue Oct 7, 2024 · 3 comments
Open

How should equality between frames behave? #2141

evansd opened this issue Oct 7, 2024 · 3 comments

Comments

@evansd
Copy link
Contributor

evansd commented Oct 7, 2024

This is fairly niche, but came up when some experienced Python programmers were first learning ehrQL. They wondered whether two frame operations produced the same thing and so tried writing frame_a == frame_b.

At present, this just uses the default Python object identity behaviour. There are two reasonable alternatives to this:

  1. Compare the related query model objects (i.e. the _qm_node attributes). This is the semantically correct thing to do: query model nodes are value objects and will compare equal iff they are structurally identical.

  2. Blow up with an error: ehrQL series objects have special equality behaviour where the result is a new series. Comparing a series with a frame already triggers a (hopefully) explanatory error. Comparing two frames is more likely to be a typo/thinko by an novice ehrQList than it is to be the investigation of a curious Pythonista, so raising an error is the more helpful thing to do.

I think I'm leaning in favour of option 2. Adding guardrails for novices is particularly important in our context. And experienced Pythonistas will be better placed to cope with the unexpected error.

I'd also add that, in the specific context in which this came up, the users were confused by the unexpected False result. But that result was actually correct in the sense that the two expression they assumed were semantically equivalent were not actually equivalent. So it's not clear that doing the semantically correct thing would actually have been helpful in any case.

Slack thread:
https://bennettoxford.slack.com/archives/C069YDR4NCA/p1727966431279409?thread_ts=1727963482.698439&cid=C069YDR4NCA

@madwort
Copy link
Contributor

madwort commented Oct 7, 2024

+1 for option 2, my guess is that you're right & that in production ehrQL this usage would be a thinko.

Heading off on a tangent about option 1:
How & when does ehrQL simplify the graph of a query? I haven't checked the details, but in the specific situation that this came up, the two frame definitions were possibly equivalent rather than equal ("where_not equals" vs "where not_equals") - they would ideally need to be in a standard form to do the best job of option 1?

Another tangent - this ticket is about what should we do for the syntax frame == frame. If we had an aim to compare the results of frames, do we currently have a way to do that (spitballing: frame.get_results() == frame.get_results())? Not for "production ehrQL usage" but for researchers to learn & explore in the sandbox.

(discussion of tangents could move to Slack rather than this ticket!)

@evansd
Copy link
Contributor Author

evansd commented Oct 7, 2024

the two frame definitions were possibly equivalent rather than equal

My point was that, unless I misunderstood the example, they aren't equivalent. The expression:

frame.except_where(condition)

Is equivalent to:

frame.where(~condition | condition.is_null())

Which it has to be if it's going to act as the inverse of where(). So it is semantically distinct from:

frame.where(~condition)

More generally though you're right that will be semantically equivalent queries (e.g. ~(a & b) and ~a | ~b) which are compare non-equal as they're structurally different.


On comparing the results of frames in the sandbox, I wonder if we could override the __iter__ method in the same way we do the __str__ method. So you could directly compare the contents of frames with:

list(frame_a) == list(frame_b)

@madwort
Copy link
Contributor

madwort commented Oct 9, 2024

they aren't equivalent

ahhhhhhhh I hadn't spotted that. In which case the idea of comparing the results of frames in the sandbox could be more misleading than it is helpful. Scratch all that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants