Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate most shrinker functions to the ir #3962

Merged
merged 40 commits into from
May 29, 2024

Conversation

tybug
Copy link
Member

@tybug tybug commented May 1, 2024

part of #3921.

This is a big one, apologies! Hard to avoid; the remaining functions had cascading dependencies on each other.

I've benchmarked these changes by hooking minimal with derandomize=False and 5 trials.

latest (866256e) image
old (a9f50c6) image
old (d9a077d) image
old (f7d0d29) image
old (d81d50e) image
old (901d636) image

I think there are a few places we are still not optimal:

  • doubling up on Integer.shrink from the negative and positive direction (see review comment)
  • to know whether an ir tree A is better than B ahead of time, we have to convert both to a buffer and sort_key the buffers, which costs a call. This will be improved when we move our ordering function to the ir tree.

Conversations of note:

@tybug tybug requested a review from Zac-HD as a code owner May 1, 2024 22:49
@@ -2292,12 +2327,16 @@ def _pop_ir_tree_node(self, ir_type: IRTypeName, kwargs: IRKWargsType) -> IRNode
# (in fact, it is possible that giving up early here results in more time
# for useful shrinks to run).
if node.ir_type != ir_type:
# needed for try_shrinking_nodes to see what node was *attempted*
# to be drawn.
self.invalid_at = (ir_type, kwargs)
self.mark_invalid(f"(internal) want a {ir_type} but have a {node.ir_type}")
Copy link
Member Author

@tybug tybug May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we've moved to a stronger typed representation in the ir, we do pretty badly on a class of problems: crossing typed boundaries. In particular shrinking one_of(A, B) where A has a different type/kwargs than B. Before, we would shrink the integer for one_of causing it to choose A instead of B. A interprets the bits differently than B, but we would get lucky most of the time and find an counterexample; and now that we're in A we do more robust shrinking. In the ir we're much more strict and mark_invalid as soon as we have a misalignment, so the initial coincidental slip from B to A never happens.

One thing that works here is drawing randomly instead of giving up when we hit a misalignment, which I would say is basically equivalent to master behavior.

Concretely:

# condition avoids trivial counterexamples
assert minimal(st.integers(101, 200) | st.integers(0, 100), lambda n: n not in [101, 0]) == 102

fails on this branch and passes on master (I'll add this as a test case as well).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idle thoughts: I wonder if a short "randomly generate ir trees smaller than our current value" phase when the shrinker would otherwise exit as stuck would get us out of many of these coincidental slips. Possibly not, since if we didn't have to exploit some fixed structure while shrinking, we likely would have found this counterexample during the generation phase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t touched Hypothesis in a few years, so take this with a grain of salt, but I vaguely recall that having the shrinker introduce randomness into the data stream has tended to be a bad idea.

Normally, the shrinker tends to make its input both shorter and simpler over time, despite the fact that (999 random bytes) is technically “smaller” than (1000 zero bytes).

But if the shrinker starts reintroducing randomness, this falls apart, and the shrinker will get obsessed with some high-entropy input because it happens to be shorter the current low-entropy input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconded on this. Specifically the problem that tends to result is that the performance becomes terrible, because you end up repeatedly reintroducing lots of expensive-to-shrink regions in order to make minor size gains.

My guess is that you could possibly fix this with some sort of low-entropy generation mode (e.g. always generate one of the two smallest values for a given type) but I'm not sure how robust that would be and it potentially still has the same problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That intuition makes sense to me. I think performance may still be ok if we only introduce randomness as a last-resort phase in the shrinker when we are otherwise at a fixpoint. iirc the shrinker used to have some dynamic enabling logic similar to this. Of course we're now complicating the shrink pass logic and have to justify the tradeoff in complexity as worthwhile. Alternatively low-entropy generation also seems reasonable.

It just feels quite bad to not be able to shrink st.integers() | st.floats() because we happened to hit an error in the floats region first, and we are seeing a few regressions in the shrink quality test suite due to this (though how much that maps to reality is unclear to me).

Comment on lines 1345 to 1354
# try shrinking from both sides towards shrink_towards
shrink_towards = kwargs["shrink_towards"]
Integer.shrink(
abs(shrink_towards - value),
lambda n: self.try_shrinking_nodes(nodes, shrink_towards + n),
)
Integer.shrink(
abs(shrink_towards - value),
lambda n: self.try_shrinking_nodes(nodes, shrink_towards - n),
)
Copy link
Member Author

@tybug tybug May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty brute force and does ~double the shrinks necessary in the worst case.

I think master behavior for integer shrinking is something like "shrink towards 0 on the current sign (only positive or only negative) and occasionally try and flip the sign bit". I think master gets a bit luckier/smarter than this in practice, because a straight-line translation of that here shrinks minimal(st.integers(), lambda n: not (-1 < n < 2)) to 2 instead of -1; to find -1 it needs to pursue the worse shrink of 2 to -2 and then shrink -2 to -1.

A complicating factor here is shrinking behavior is different for bounded and unbounded integers:

assert minimal(st.integers(), lambda n: not (-1 < n < 2)) == -1
assert minimal(st.integers(-5, 5), lambda n: not (-1 < n < 2)) == 2
assert minimal(st.integers(max_value=5), lambda n: not (-1 < n < 2)) == -1
assert minimal(st.integers(min_value=-5), lambda n: not (-1 < n < 2)) == -1

which we can't change until we move the ordering to the ir.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to find -1 it needs to pursue the worse shrink of 2 to -2 and then shrink -2 to -1

I suspect it's getting lucky with some sort of punning here, as the integers strategy is a mixture under the hood. I don't think it can reliably do this strategy - there's no backtracking in the shrinker - but it might be doing something that causes previous bytes for a larger integer width to be reinterpreted as a negative integer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good point. master fails on a trivial extension of this problem:

assert minimal(st.integers(101, 200) | st.integers(0, 100), lambda n: n not in [101, 102, 0]) == 103

so it's just getting lucky.

@Zac-HD
Copy link
Member

Zac-HD commented May 3, 2024

Can we get a secondary (log) y-axis on the plot showing proportional change, in addition to the absolute number? +800 calls is quite different on a base of 50 to 2,000!

@tybug
Copy link
Member Author

tybug commented May 3, 2024

I've updated the graph to include an n✕ change axis. I'm looking into the worst discrepancies there next.

@Zac-HD
Copy link
Member

Zac-HD commented May 3, 2024

I've updated the graph to include an n✕ change axis. I'm looking into the worst discrepancies there next.

Fwiw this looks fine to me - the large-negative multipliers are relative to a small baseline, and the tails with large absolute changes aren't that large in relative terms. Might as well look into it, but don't stress too much + consider doing so as a follow-up later.

@tybug
Copy link
Member Author

tybug commented May 4, 2024

agree it doesn't look hugely out of line, but I won't feel confident in this pull until I look into it, for my own sake more than anything 😄

I updated the benchmark for 50b2b13 + d81d50e. It's a bit misleading because we're basically getting a free constant shrink boost by disabling the dfa, so I would mentally downweight the improvements a bit. That said, it's looking quite a bit better. More investigation of where we do badly to come.

@Zac-HD
Copy link
Member

Zac-HD commented May 4, 2024

I updated the benchmark for 50b2b13 + d81d50e. It's a bit misleading because we're basically getting a free constant shrink boost by disabling the dfa, so I would mentally downweight the improvements a bit. That said, it's looking quite a bit better. More investigation of where we do badly to come.

Nice! You could get a buffer-without-dfas baseline if you wanted to, but I don't think this is really necessary - you're already measuring the change as we're going to ship it, and it's looking good. I guess in principle the downside might be some loss of shrink quality, but I don't expect anyone to notice that.

@tybug
Copy link
Member Author

tybug commented May 25, 2024

@Zac-HD how important would you say the test_removes_needless_steps and test_prints_equal_values_with_correct_variable_name stateful failures (eg) are as blockers? It appears stateful shrinking with bundles may have gotten worse. It's going to take me a while to track down why though, because the ir trace for bundle draws isn't exactly easy to follow.

@Zac-HD
Copy link
Member

Zac-HD commented May 25, 2024

How important are test_removes_needless_steps and test_prints_equal_values_with_correct_variable_name stateful failures (eg) are as blockers? It appears stateful shrinking with bundles may have gotten worse.

I'm OK with taking a regression on these, since I'm confident that we can fix it later and we're seeing a substantial win overall. Just make sure to track them in an issue, along with the other shrinker issues noted in comments / review below.

Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through the whole diff, and it looks good to me - I've tagged everywhere that I thought we might want to revisit later, but overall I'll be happy with minimal changes to get the tests passing.

And- this is a huge contribution. Thanks, Liam 😍

hypothesis-python/RELEASE.rst Outdated Show resolved Hide resolved
hypothesis-python/tests/conjecture/test_engine.py Outdated Show resolved Hide resolved
hypothesis-python/tests/conjecture/test_minimizer.py Outdated Show resolved Hide resolved
@tybug
Copy link
Member Author

tybug commented May 26, 2024

ah good, now we get to the fun step of debugging newly flaky tests 🙂

@tybug
Copy link
Member Author

tybug commented May 26, 2024

looks like an actual failure: https://github.com/HypothesisWorks/hypothesis/actions/runs/9240572106/job/25421012169?pr=3962. We're shrinking to 102 instead of 101. I won't be getting to this tonight.

@tybug
Copy link
Member Author

tybug commented May 27, 2024

I pushed two fixes:

  • 41b15e3 where we tried out of bounds integers while shrinking, leading to assertion errors. we should move our assertions about forced values in draw_* to assert ir_value_permitted(forced, ir_type, kwargs) (have a personal todo for this and will follow up)
  • 866256e which fixes the above failure (explanation below).

Updated graphs in description.


Here's the cause of the above failure test_reports_target_results:

  • we generate two counterexamples: x=101 and x=n > 101, in that order. x=101's buffer is very large because of e.g. failed probes or differing size buckets.
  • we try to shrink x=n, and eventually try x=101 while shrinking. But we had already cached the ir tree [101] as corresponding to a large buffer above, so we're "blocked" from progressing any farther than x=102.

We could potentially solve a large number of these occurrences by removing discarded portions of the buffer before caching. A more general approach which also handles "larger because drawing from different size buckets" is not caching interesting buffer datas at all and relying on the ir tree corresponding to the smallest buffer via forced. I think this is the right approach given that the vast majority of datas with status.INTERESTING are on the ir via the shrinker, so are fine to cache. The overhead is one or two cache misses when trying the same interesting buffer data.

reproducer
@given(st.integers(min_value=0))
@settings(database=None)
def f(x):
    assert x <= 100

for i in range(500):
    print("-" * 50, i)
    try:
        f()
    except Exception as e:
        assert "x=101" in "".join(traceback.format_exception(e))

@tybug
Copy link
Member Author

tybug commented May 27, 2024

oops, I read the wrong number — shrinking is a mean of 2.12x faster, but a median of 1.38x faster. I fixed the release notes. I think the median is the more useful and realistic number to report, though getting big wins on the tail end is of course also important.

@tybug
Copy link
Member Author

tybug commented May 29, 2024

I looked into the stateful regression and I'm confident we can improve it in the near future. The short of it is we'll need to support slips from the integer 0 to the boolean False, and realign pseudo choices. ie it should be allowed to draw min_value=0 max_value=0 from an ir tree with a node of value=2.

I've completely skipped the problematic tests for now.

Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: 🎊

@Zac-HD Zac-HD enabled auto-merge May 29, 2024 02:55
@Zac-HD Zac-HD merged commit 822e39d into HypothesisWorks:master May 29, 2024
54 checks passed
@tybug tybug deleted the shrinker-ir branch May 29, 2024 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants