Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duration_based_chunks algorithm produces empty groups, causing pytest to fail when collecting tests #26

Open
alexforencich opened this issue Jun 18, 2021 · 28 comments

Comments

@alexforencich
Copy link

Something in the most recent set of changes has broken something. Previously I have been doing --splits 20 and all of the groups have at least one test. Now, at least the last two groups are empty, which causes pytest to exit with error 5 and github actions to kill all of the tests. Not sure what the precise cause of this is at the moment.

@alexforencich
Copy link
Author

Yep, there is definitely something borked with duration_based_chunks. It is producing empty groups. Flip side, it seems like a pretty useless algorithm anyway that's highly dependent on the test ordering. The least_duration algorithm seems to be much more sensible, and I think it's not possible for it to produce empty groups so long as there are at least as many tests as there are groups.

@alexforencich alexforencich changed the title Getting empty groups causing pytest to fail duration_based_chunks algorithm produces empty groups, causing pytest to fail when collecting tests Jun 19, 2021
@jerry-git
Copy link
Owner

Thanks for reporting @alexforencich, can you pinpoint which version started causing this? Or can you share the repo in which this is happening?

@alexforencich
Copy link
Author

I have not attempted to figure out which version caused this, but presumably this was introduced when the duration_based_chunks algorithm was added.

Repo in question: https://github.com/alexforencich/verilog-pcie

Commit that caused the workflow run to fail: alexforencich/verilog-pcie@ccc44d7

Workflow run: https://github.com/alexforencich/verilog-pcie/runs/2861569844?check_suite_focus=true

@jerry-git
Copy link
Owner

Thanks @alexforencich! FYI @mbkroese

@mbkroese
Copy link

@alexforencich thanks for reporting. Is it possible to run your test suite with this version: 0.2.3?
This is the version before the last one, and would help us pin down the problem.

@alexforencich
Copy link
Author

alexforencich commented Jun 21, 2021

Yes, 0.2.3 is also affected.

@alexforencich
Copy link
Author

Did some more testing, 0.1.6 is OK, but 0.2.0 is broken.

@mbkroese
Copy link

mbkroese commented Jun 22, 2021

This is a small example that reproduces the issue:

from collections import named tuple

from pytest_split import algorithms

item = namedtuple('item', 'nodeid')

durations = {}
durations['a'] = 2313.7016449670773
durations['b'] = 46.880724348986405
durations['c'] = 2196.7077018650016
durations['d'] = 379.9717799640057
durations['e'] = 1476.3481151770247
durations['f'] = 979.7326026459923
durations['g'] = 1876.5443489580794
durations['h'] = 1.3951316330058035

items = [item(x) for x in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']]

groups = algorithms.duration_based_chunks(7, items, durations)

@alexforencich
Copy link
Author

The problem is that the algorithm is extremely greedy. Seems to me that this case is likely to happen if there are a bunch of short tests that are processed last. One, ah, stupid fix would be to move on to the next group if there are less than splits - group_idx items left to process. This would ensure that all of the groups have at least one test, but the algorithm is still extremely greedy so you would still end up with a very lopsided split.

@jerry-git
Copy link
Owner

Thanks for the investigation and sorry for pinging @mbkroese 😅 If it happened between 0.1.6 and 0.2.0 then it's most likely something in #14, FYI @sondrelg.

@jerry-git
Copy link
Owner

Yeah indeed sounds like we should have some mechanism which ensures that there's at least one test per group (and if splits > len(tests) -> error)

@sondrelg
Copy link
Contributor

Whoops, that's unfortunate. I could probably add a test and make sure it works again sometime this week if needed 🙂

Just for the sake of my own curiosity @alexforencich, how small is your test sample size? I think I might have changed the behavior from stopping before a group exceeds the time-limit to stopping when it exceeds. My thinking was that front-loading tests a little bit makes sense 99% of the time, because at least in Github actions, containers seem to spin up sequentially; i.e., group 1 starts considerably earlier than group 20 (at least for me).

Then again, I guess if there are a very small number of tests this could create some weird behavior. Is that what's happening here?

@alexforencich
Copy link
Author

The other algorithm (least_duration) is fine as it adds tests to the group with the smallest duration. So this means that if you have N splits, the first N tests get distributed across those splits. This is also a very simple greedy algorithm, but at least it's guaranteed to not produce any empty groups so long as you have at least as many tests as you have splits. I suppose what you could do if you have less tests than you have splits is just duplicate one of the tests in all of the remaining splits so pytest doesn't crash.

I think there are two parts to this: 1) fix duration_based_chunks, and 2) add some safeguard somewhere else to ensure that it's not possible to ever produce any empty groups so that pytest doesn't crash no matter what the splitting algorithm does (unless, of course, there are no tests at all).

@alexforencich
Copy link
Author

I want to say it's something like 170 tests split into 20 groups. But the spread of run-times is quite large, some of them run in a few seconds, some take 10 minutes. And that spread is probably what's screwing up the greedy algorithm that places tests into each group until it hits the average, then moves on to the next group. I think if you make a test case of, say, 10 tests of 1 second each, then 1 test of 100 seconds each, and ask for 10 groups, you'll get 1 group with all of the tests and 9 empty groups.

@mbkroese
Copy link

sorry for pinging @mbkroese

happy to help :)

@sondrelg
Copy link
Contributor

Opened a quick PR. Please take a look when you have time 🙂

@jgopel
Copy link

jgopel commented Jan 31, 2023

Any update on this? I seem to be encountering this issue in 0.8.0.

@sondrelg
Copy link
Contributor

This is the last thing to happen AFAIK

@jgopel
Copy link

jgopel commented Jan 31, 2023

Seems like the PR is abandoned - is there interest in reviving it?

@alexforencich
Copy link
Author

Is using least_duration not an option for you?

@jgopel
Copy link

jgopel commented Jan 31, 2023

The issue I have is that the tests were designed such that there are optimizations around data loading, etc that are setup in the first test of the suite to run and then consumed by all tests that follow. Using least_duration ends up with inconsistency in .test_durations because whether a test is the first test of a suite to run in a particular group is not deterministic. I also get better performance from grouping similar tests together for the same reason.

@alexforencich
Copy link
Author

It almost sounds like you want to do something similar to what's detailed in the readme for nbval - shuffling things around to keep groups together and preserve the ordering. So perhaps even more restrictive than what the default algorithm is doing, since the default one I think will split things up.

@jgopel
Copy link

jgopel commented Feb 1, 2023

I had missed that part of the readme. That looks like pretty much exactly what I'd like to do in the ideal case. Is there a mode for that? Or is it automatically detected?

@alexforencich
Copy link
Author

I suspect that it is linked to using nbval specifically. But it seems like it might be a useful mode to have in general.

Perhaps there is a general way to do this: instead of dealing with individual tests, handle groups of tests, with tests grouped according to some parameters, perhaps based on the file. The current behavior would just be putting each test in its own group.

@jgopel
Copy link

jgopel commented Feb 1, 2023

Are you imagining that the user could provide a predicate for custom splitting? How would that be specified and passed in?

@alexforencich
Copy link
Author

not sure, just tossing out some ideas. But a simple one could be an option to keep all of the tests defined in each file together.

@jgopel
Copy link

jgopel commented Feb 1, 2023

That makes sense to me as a starting point. Would there be an appetite to merge that if I were to do the work?

@Prodian0013
Copy link

I ran into this same issue with duration_based_chunks trying to run empty groups 0 selected I had to switch to using least_duration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants