Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of test front-loading in the duration_based_chunks algo #27

Closed
wants to merge 2 commits into from

Conversation

sondrelg
Copy link
Contributor

Should hopefully resolve #26

@alexforencich
Copy link

Nope, still fails. And if I change the test durations in that test case to all 1 except the last one which gets 100, all of the tests end up in the first group and I get 6 empty groups.

@sondrelg
Copy link
Contributor Author

How would you suggest changing the implementation @alexforencich? I'm not seeing the issue 🙂

@alexforencich
Copy link

So I did not run the test correctly (have to actually install the package before you can run pytest). After fixing that problem, the test case in the PR passes. However, it fails if I change it to:

    durations = {}
    durations["a"] = 1
    durations["b"] = 1
    durations["c"] = 1
    durations["d"] = 1
    durations["e"] = 1
    durations["f"] = 1
    durations["g"] = 1
    durations["h"] = 100

In this case, only the first two groups are utilized and the rest are empty.

@alexforencich
Copy link

It seems to me like the goal of this algorithm is to preserve the test ordering. So perhaps a better method would be to compute the cumulative sum into a new array, and then partition that on total_duration*index/slices. Then adjust the indicies so they each correspond to at least one test.

Incidentally, I think you can get slightly better performance out of least_duration by sorting the tests by duration first so they are processed from longest to shortest.

However, the best option might be to do something like this: assign each test an ID number based on the original order, then sort based on duration, use the algorithm in least_duration to group the tests, then sort each group based on the ID. This will preserve the test order, but produce more optimal grouping, and there will be no issue with empty groups unless there are more groups than tests (and in that case, what you can do is enable some random test in all of the otherwise-empty groups).

@sondrelg
Copy link
Contributor Author

So I did not run the test correctly (have to actually install the package before you can run pytest). After fixing that problem, the test case in the PR passes. However, it fails if I change it to:

    durations = {}
    durations["a"] = 1
    durations["b"] = 1
    durations["c"] = 1
    durations["d"] = 1
    durations["e"] = 1
    durations["f"] = 1
    durations["g"] = 1
    durations["h"] = 100

In this case, only the first two groups are utilized and the rest are empty.

That almost seems like a feature, not a bug; but more importantly, is that something that was changed in #14? We discussed fixing the change in behavior and the possibility of guaranteeing that every group has a test in #26, but to be clear - this PR just tries to fix the first point.

Would the test above behave differently in previous versions?

@alexforencich
Copy link

Well, if it wasn't causing github actions to crash, I wouldn't care as much. But it is, so it's a bug. Looks like yes, the splitting algorithm was rewritten in #14. Looks like the previous version maybe was maybe duplicating tests in this case (not explicitly, but due to a side-effect of how it was written) so there weren't any empty groups, but this wasn't noticed since it didn't cause pytest to exit prematurely.

@mbkroese
Copy link

Incidentally, I think you can get slightly better performance out of least_duration by sorting the tests by duration first so they are processed from longest to shortest.

@alexforencich you're right, that would be better. I just stuck with the current solution because that preserved relative order, which I thought was important for some. I'm happy to add this at a later stage.

@sondrelg
Copy link
Contributor Author

Out of curiosity, why do people care about ordering? I really only want my test groups to be as even as possible, because the longest group sets the total time of my pipeline.

@mbkroese
Copy link

Just imagine you work for a company that has a large suite of tests and the corresponding technical debt. The tests already have order dependency (by accident) but it's too big a problem to fix all at once.

@sondrelg
Copy link
Contributor Author

Yep lack of test isolation is the only thing I can think of 🙂

@alexforencich
Copy link

Well, you still might have ordering dependency issues as you're not going to run all of the tests.

For relative order, you can restore that by assigning monotonically increasing IDs before doing the partitioning, then sort each group by this ID before running the tests.

@sondrelg
Copy link
Contributor Author

Updated the PR now @alexforencich @jerry-git

The duration_based_chunks algo should now have the following properties:

  • Guarantees that every group is assigned a test (as long as test count > splits)
  • Maintains order

@alexforencich
Copy link

Definitely looks a lot better. Still greedy I think, but not sure if there is a nice way of improving that without making it a lot more complicated. At any rate, if you want the most even split, least_duration is probably a better choice.

Next step is to add some "safety" code to go ensure that there is at least one test enabled in each group, even if there are fewer tests than groups, by enabling a test in empty groups. But that's not for this PR.

@sondrelg
Copy link
Contributor Author

sondrelg commented Jul 9, 2021

@jerry-git this is ready to review if you get a chance 🙂

@jerry-git
Copy link
Owner

Ahoy! Sorry this took a while, I was on vacation mode 🍹 I'll have a look.

tests/test_algorithms.py Outdated Show resolved Hide resolved
durations["g"] = 1
durations["h"] = 1

items = [item(x) for x in ["a", "b", "c", "d", "e", "f", "g", "h"]]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be the same if the loop would be over sorted(durations.keys())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean, sorry?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he's saying you can do items = sorted(durations.keys()) instead of duplicating all of the items.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'm basically asking if this would be the same:

items = [item(x) for x in sorted(durations.keys())]

IMO it would be better considering readability as then it's explicit that there's one item for each thingy in durations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied in the example test from here and made sure we had one example that was extremely front-loaded and one that was extremely back-loaded. I can make it less explicit if you prefer.

tests/test_algorithms.py Outdated Show resolved Hide resolved
Comment on lines +74 to +77
(4, 1, "duration_based_chunks", ["test_1", "test_2", "test_3"]),
(4, 2, "duration_based_chunks", ["test_4", "test_5"]),
(4, 3, "duration_based_chunks", ["test_6"]),
(4, 4, "duration_based_chunks", ["test_7", "test_8", "test_9", "test_10"]),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean that the expected durations for each group will be:

  • 1: 3s (1+1+1)
  • 2: 2s (1+1)
  • 3: 2s (2)
  • 4: 8s (2+2+2+2)

If so, it doesn't feel right 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1-5 are 1 second tests, while 5-10 are 2 second tests, so I believe the timings are

  • 3 seconds
  • 3 seconds
  • 2 seconds
  • 8 seconds

And yes this is weird, but it's also what you should expect if you want to stop adding tests before going over the threshold of 15/4=3,75 seconds, and keeping absolute order 🤷

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very simple greedy algorithm; it's pretty much unavoidable that you're going to get strange far-from-optimal results like this. The least_durations algorithm is much better and IMO should be the default and this algorithm should simply be removed. But this algorithm does have the benefit of maintaining the sequence of tests, which I suppose could potentially be beneficial in some cases if there are ordering dependencies between tests.

Copy link

@alexforencich alexforencich Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, here's something to consider potentially: exceed the threshold, but keep track of a running total overage and use that when determining if you're over the threshold. (If you never exceed the threshold, then you're basically guaranteed to end up with the last group being larger than the rest) There are probably a few other heuristics that you could apply here to try to improve things a little bit. For instance, something like this: ignore overage if the slot is empty and unconditionally assign, if it isn't empty then check to see if the resulting overage is less than half of the time required for that test, if so pack it, otherwise bump it to the next group.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The least_durations algorithm is much better and IMO should be the default and this algorithm should simply be removed

I have a use case which requires this plugin to support maintaining the order of the tests, which is why we need to keep duration_based_chunks or something similar. However, I can consider changing the default but that'd probably bite some other users of this plugin as well so need to consider that carefully 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll close this for now then 👍

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, without these changes, is duration_based_chunks still producing empty groups? Or did that get fixed in a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes probably. I noticed there are a few conflicts, so perhaps there's been another fix merged, but that's what this was meant to fix.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, well my point was that even if the splitting isn't as perfect as it could be, at least this fix doesn't result in empty groups, so it should be merged and then possibly improved, instead of being closed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. I've reopened it - feel free to make any additional changes you wish (cc @jerry-git) and merge. I'm on holiday now myself so likely won't be able to apply any code changes for a while.

Comment on lines +94 to +96
groups = duration_based_chunks(8, items, durations)

for i in range(7):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 8 above and range(7) here in the loop below? Could we somehow get rid of the magical numbers and use e.g. variables which names would help the reader to understand what's going on?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use for g in groups: instead?

Comment on lines +108 to +109
for i in range(7):
assert groups[i].selected != []
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this same as assert all(group.selected for group in groups)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like it would probably work; maybe just add a comment to indicate that we're making sure there aren't any empty groups. Or maybe explicitly check that the size of each group is nonzero.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're making sure there aren't any empty groups ... check that the size of each group is nonzero

IMO all(group.selected for group in groups) does both and is explicit enough so no comments or further explanations would be needed. Just my opinion ofc 🙂

@jerry-git
Copy link
Owner

Closing as stale, feel free to reopen 🙂

@jerry-git jerry-git closed this Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

duration_based_chunks algorithm produces empty groups, causing pytest to fail when collecting tests
4 participants