-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provoke bugs that rely on iteration order of hash maps/sets #181
provoke bugs that rely on iteration order of hash maps/sets #181
Conversation
Previously, pubgrub used an alias for HashMap so that everything used the same implementation. But it didn't do the same for HashSet, and indeed, there were some uses of FxHashSet and std::collections::HashSet.
This creates new types instead of aliases so that we have complete control over what is being done. In particular, we'd like to use this to shuffle iteration order to try and provoke a non-deterministic bug.
We'll need this so that we can shuffle iteration order. Note that we also include wrappers for iterator types, which is where we'll shuffle the ordering in the next commit. Our iterator types are of course not generally appropriate since they allocate everything up front, but it's good enough for provoking non-deterministic behavior.
And finally, this commit is the pay-off of previous refactorings. We can now easily randomize iteration order over hash maps/sets. The idea here is to provoke bugs that depend on iteration order.
We need to make our wrapper implement the serde traits.
One thing we have put some effort into when designing this lib is removing as much as possible anything that would produce non-reproducible effects. In an effort in that direction, I remember we chose hash functions that are explicitly deterministic. If we recently introduced one that is not deterministic, it may be an oversight? The test you highlight "prop_same_on_repeated_runs" is here to catch such regressions. I'll let Jacob complement this info as I haven't followed as closely as I hoped the recent changes. |
Also the errors you see is I believe an error stating that the failure cases are not the same. So if you explicitly introduced hash maps with shuffled iterations order, it makes sense that the dependency trees in an error case can be different. Indeed, there can be different reasons why a set of dependencies might be incompatible. And different paths may produce different errors. So I interpret the error you got as a normal behavior of that failsafe test, which detects something non-reproducible is happening. But it doesn't mean the solver failed when it should have succeeded. It just means it failed differently. At least that's my interpretation. |
I believe that interpretation roughly matches the kind of failure we're seeing on our non-public code. That is, something about how the solver failed can change from run-to-run, which in turn changes the error message presented to the end user. |
First off it's an honor to receive this PR/bug report from you! I have long been an admirer of your work. As you noticed, it is much harder to test code that has non-determinism. So for our own sanity, we try and make sure If any of the other tests were failing with this change, I would be much more concerned. We should not have the correctness of the output depend on the iteration order of hash maps. We had such a bug in one of our early releases, and the non-determinism was very helpful in sussing it out. it would be really nice to have the hasher available in our public API, so that we could have a "is deterministic" test with fxhash and a "error does not depend on hashing" if it uses std's randomized. But the complexity of the public API is not worth it. (I'm open to other suggestions about having our cake and eating it too.) If you are seeing nondeterministic behavior, then let's extend our tests to find it and keep us from reintroducing it. |
I guess the way I look at this is that this PR exposes some kind of non-deterministic behavior that is already present in the library. It "just" makes it much easier to see. AFAIK, even if you're using a deterministic hash function, that doesn't give you any guarantees about the iteration order of a hash map/set. The order is still arbitrary. I believe there are some uses of |
OK, I opened #182 for that last change. |
I think your paragraphs are in contradiction with each other. |
It's a shot-in-the-dark band-aide and makes the library consistently use To be clear, I have no clue whether the last remaining uses of |
Places where I can easily see the order of the solver changing are:
If the first one is the cause of the problem, then you should have a look at the prioritization code in your private repo. I believed the iteration order should stay the same but you're certainly right that a deterministic hasher does not mean a deterministic iteration order, as I don't know if they are related. If that's indeed the problem, I think we could fix it by using an indexed map. Would be nice to confirm that's the source. |
Also @Eh2406 if the unit propagation buffer is turned into a hash map (with non-deterministic order), it may have the same effect. |
If it's helpful, here's the diff for the snapshot for one of our flaking tests And because albatross==3.0.0 depends on bluebird==3.0.0 we can conclude that any of:␊
albatross<2.0.0␊
albatross>2.0.0␊
depends on one of:␊
- bluebird<=1.0.0␊
- bluebird>=3.0.0␊
+ bluebird==1.0.0␊
+ bluebird==3.0.0␊
␊
And because you require one of:␊
albatross<2.0.0␊
albatross>2.0.0␊
- and you require bluebird>=2.0.0,<3.0.0, we can conclude that the requirements are unsatisfiable.
+ and you require bluebird>=2.0.0,<3.0.0, we can conclude that the requirements are unsatisfiable.␊ |
I think your technically correct, the best kind of correct. STD probably makes no guarantees about its iteration order, and it's allowed to add a "shuffle before iterating" as an implementation detail without regard to which hasher is used. Given that we don't rely on iteration order for correctness, only for ease of testability, I feel happy leaving that in "theoretical problem land" until bigger harms are demonstrated or STD actually adds that code. That is to say, if there is reason to reduce our reliance on the iteration order wonderful. (Performance, and available bug demonstrating inconsistency, correctness, code clarity...) But without one of those I'm still willing to trust that things are deterministic enough.
Can I suggest running repeatedly with extensive logging until you see two different results, and then comparing where the logs first diverge? Even more thoroughly you could record traces with
After many layers of type aliases the map is https://github.com/pubgrub-rs/pubgrub/blob/release/src/type_aliases.rs#L6 so that's unlikely to be the problem.
That is very good suggestion, make a branch changing all Hash* to use IndexMap, and see if your nondeterminism goes away.
Yes. Definitely. If that code goes non-deterministic testing is going to be very painful.
That is really interesting. Is it possible to diff the DerivationTree? That could help isolate if the difference is happening in resolution or if it's happening in reporting the error. |
Yeah that was the first thing I tried. But it is spectacularly difficult to reproduce unfortunately. I've tried repeating the flaky tests (on our end) 10,000 times in multiple different ways, but can't force it to fail. It just randomly fails in CI or locally. In my experience, the only pattern I tend to see is that it fails when I run tests for the first time after I've "been away" for a while. (e.g., When I start work in the morning.) But that could be a phantom. That's when I switched gears to this method with the hope it would provoke something. My next step is probably to indeed to add some kind of extra logging about the internal state of things when the test fails. (With the derivation tree indeed being something I'd want to record.) But this will require some non-trivial work on our end I think just because of how the test works. So I'm basically trying the "easier" things first to track it down. The other thing on my mind to try is to take the approach in this PR and make it possible to toggle at a finer level of granularity. i.e., Replace individual hashmaps one-at-a-time with hashmaps using explicitly random order until I provoke the test failure. For example, with this PR change, I can provoke a test failure on our internal code reliably. (Although the diff in that case is larger than the one @zanieb shared above.) |
I added a debug print of the derivation tree to our snapshots and it looks like it does not change! I think this suggests that it's either non-determinism in |
It's good to hear that it's not in resolution proper. That's code I know pretty well, and I couldn't think of what could cause this. The new line is indeed intriguing. Nothing in this repo generates "we can conclude that" so I'm flying real blind. Do you have more than one code paths that could generate that? Do you have code paths that generate or remove new lines? Are you joining the output from a vec of strings, maybe there's a empty string at the end? The changed version requirements does make me suspicious of simplify. The requirements may or may not be the same depending on what versions are available. Are there versions below 1 or higher then 3? I know you made improvements to simplify, but I'm reasonably sure that this repos implementation is deterministic as long as the inputs are. Could it be the iterator you provide to simplify, perhaps it's not sorted or it sorted differently between builds? Now that we know it's not the resolver proper, perhaps add logging to every function call involved in error reporting? Perhaps even log every argument to important functions? This sure is a tricky one. |
This bug was internal to our resolver. |
This PR is not meant to be merged.
This PR hacks in a hash map/set wrapper that forcefully shuffles
iteration order with the goal of provoking bugs that result from
depending on a particular iteration order. Namely, hash maps/sets
in Rust's standard library don't have a defined iteration order.
With that said, I think that in practice, their order is somewhat
stable/predictable. But in some code we're working on at Astral (that
isn't public just yet), we've noticed some non-deterministic test
failures that we believe might be coming from pubgrub. This PR is
a test of that theory by forcing the issue and more aggressively
shuffling iteration order.
With this PR, I have been able to find one consistently failing test:
prop_same_on_repeated_runs
. I'm hopeful it will show up in CI too.But the failure output looks like this for me locally:
I don't know much about pubgrub internals, so I haven't really been
able to investigate the root cause of this failure. What do you think?