-
-
Notifications
You must be signed in to change notification settings - Fork 3k
Switch to a more dynamic SCC processing logic #20053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
Hm, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's exciting to see the foundation for parallel checking moving forward!
Not a full review -- did a quick pass only. I wonder if the missing errors might indicate that some modules aren't processed -- seems like worth double checking that no modules aren't silently ignored, just in case.
mypy/build.py
Outdated
if fresh: | ||
done = fresh | ||
else: | ||
done, processing = manager.get_done(graph) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment explaining that this processes some SCCs / waits for some SCCs to be done? When reading this for the first time, I thought this would just get whatever is already done right now, but it actually does some work. Or maybe the method name could be more descriptive.
Maybe rename processing
to has_more
or similar -- at first I was confused by it.
If we'd be doing parallel processing, would this perform a busy loop while waiting for things to happen? Or would get_done
wait until something has finished processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we'd be doing parallel processing, would this perform a busy loop while waiting for things to happen? Or would
get_done
wait until something has finished processing?
I was thinking about something like select()
waiting for worker sockets. So this will block until some stale SCC(s) get finished (since we know there is nothing else we can progress). IIUC select()
should not busy loop (and also I guess there is a Windows equivalent for this if we would use named pipes).
Btw I think my initial implementation will be Linux/Mac only, I was reading more about it, and the situation there is a bit of a mess. There are already a bunch of other things to worry, so I guess it is OK to add Windows support later.
Yeah, I was going to clone that repo and do this. |
OK, so I checked and I think this is a bug in When I check:
IIUC the problem is that we are supposed to install
If you look at the
so |
This comment has been minimized.
This comment has been minimized.
@JukkaL it would be great if you can take a look at this rather sooner than later. I have two other PRs stacked on top of this that make cache meta files ~4x smaller. |
@JukkaL also ping on this one. I f there are no more comments, I will be merging this later today. |
I wanted to try test this on the big codebase at work, but I can test it after merging as well. I'll review this now (feel free to merge if I haven't left a review in the next few hours). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found two potential performance issues, otherwise looks good.
mypy/build.py
Outdated
if fresh: | ||
manager.trace(f"Queuing {fresh_msg} SCC ({scc_str})") | ||
manager.trace(f"Found {fresh_msg} SCC ({scc_str})") | ||
scc = order_ascc_ex(graph, ascc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be pretty slow for large SCCs. If there are no errors, we can skip the order_ascc_ex
call (which is most of the time, esp. for third-party deps). I.e. add a check if any module has errors, and only perform the ordering op if that's true. I think we'd only need to do the ordering if more than one module has errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point.
if missing_sccs: | ||
# Load missing SCCs from cache. | ||
fresh_sccs_to_load = [ | ||
manager.scc_by_id[sid] for sid in manager.top_order if sid in missing_sccs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is linear (does a for loop over all SCCs). I think this means that the incremental logic is O(n**2). I'm not sure if this will be a real issue, but at least it would be good to have a TODO comment about it. We could make this faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess an alternative would be making top_order
a dict, and doing something like sorted(missing_sccs, key=lambda scc: manager.top_order[scc])
, or do you have something else in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I added a TODO for now, I can fix it in a follow up PR if needed.
Diff from mypy_primer, showing the effect of this PR on open source code: scipy-stubs (https://github.com/scipy/scipy-stubs)
- tests/datasets/test_utils.pyi:6: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_utils.pyi:6: error: Module has no attribute "clear_cache" [attr-defined]
- tests/datasets/test_utils.pyi:9: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_utils.pyi:9: error: Module has no attribute "clear_cache" [attr-defined]
- tests/datasets/test_utils.pyi:9: error: Module has no attribute "ascent" [attr-defined]
- tests/datasets/test_utils.pyi:10: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_utils.pyi:10: error: Module has no attribute "clear_cache" [attr-defined]
- tests/datasets/test_utils.pyi:10: error: Module has no attribute "ascent" [attr-defined]
- tests/datasets/test_utils.pyi:11: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_utils.pyi:11: error: Module has no attribute "clear_cache" [attr-defined]
- tests/datasets/test_utils.pyi:11: error: Module has no attribute "ascent" [attr-defined]
- tests/datasets/test_utils.pyi:14: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_utils.pyi:14: error: Module has no attribute "clear_cache" [attr-defined]
- tests/datasets/test_utils.pyi:14: error: Module has no attribute "electrocardiogram" [attr-defined]
- tests/datasets/test_utils.pyi:15: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_utils.pyi:15: error: Module has no attribute "clear_cache" [attr-defined]
- tests/datasets/test_utils.pyi:15: error: Module has no attribute "electrocardiogram" [attr-defined]
- tests/datasets/test_utils.pyi:16: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_utils.pyi:16: error: Module has no attribute "clear_cache" [attr-defined]
- tests/datasets/test_utils.pyi:16: error: Module has no attribute "electrocardiogram" [attr-defined]
- tests/datasets/test_utils.pyi:19: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_utils.pyi:19: error: Module has no attribute "clear_cache" [attr-defined]
- tests/datasets/test_utils.pyi:19: error: Module has no attribute "face" [attr-defined]
- tests/datasets/test_utils.pyi:20: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_utils.pyi:20: error: Module has no attribute "clear_cache" [attr-defined]
- tests/datasets/test_utils.pyi:20: error: Module has no attribute "face" [attr-defined]
- tests/datasets/test_utils.pyi:21: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_utils.pyi:21: error: Module has no attribute "clear_cache" [attr-defined]
- tests/datasets/test_utils.pyi:21: error: Module has no attribute "face" [attr-defined]
- tests/datasets/test_utils.pyi:24: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_utils.pyi:24: error: Module has no attribute "clear_cache" [attr-defined]
- tests/datasets/test_utils.pyi:24: error: Module has no attribute "ascent" [attr-defined]
- tests/datasets/test_utils.pyi:24: error: Module has no attribute "electrocardiogram" [attr-defined]
- tests/datasets/test_utils.pyi:24: error: Module has no attribute "face" [attr-defined]
- tests/datasets/test_utils.pyi:26: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_utils.pyi:26: error: Module has no attribute "clear_cache" [attr-defined]
- tests/datasets/test_utils.pyi:26: error: Module has no attribute "ascent" [attr-defined]
- tests/datasets/test_utils.pyi:26: error: Module has no attribute "electrocardiogram" [attr-defined]
- tests/datasets/test_utils.pyi:26: error: Module has no attribute "face" [attr-defined]
- tests/datasets/test_fetchers.pyi:7: error: Expression is of type "Any", not "ndarray[tuple[int, int], dtype[unsignedinteger[_8Bit]]]" [assert-type]
- tests/datasets/test_fetchers.pyi:7: error: Module has no attribute "ascent" [attr-defined]
- tests/datasets/test_fetchers.pyi:9: error: Expression is of type "Any", not "ndarray[tuple[int], dtype[float64]]" [assert-type]
- tests/datasets/test_fetchers.pyi:9: error: Module has no attribute "electrocardiogram" [attr-defined]
- tests/datasets/test_fetchers.pyi:11: error: Expression is of type "Any", not "ndarray[tuple[int, int, int], dtype[unsignedinteger[_8Bit]]]" [assert-type]
- tests/datasets/test_fetchers.pyi:11: error: Module has no attribute "face" [attr-defined]
- tests/datasets/test_fetchers.pyi:12: error: Expression is of type "Any", not "ndarray[tuple[int, int, int], dtype[unsignedinteger[_8Bit]]]" [assert-type]
- tests/datasets/test_fetchers.pyi:12: error: Module has no attribute "face" [attr-defined]
- tests/datasets/test_fetchers.pyi:13: error: Expression is of type "Any", not "ndarray[tuple[int, int], dtype[unsignedinteger[_8Bit]]]" [assert-type]
- tests/datasets/test_fetchers.pyi:13: error: Module has no attribute "face" [attr-defined]
- tests/datasets/test_download_all.pyi:6: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_download_all.pyi:6: error: Module has no attribute "download_all" [attr-defined]
- tests/datasets/test_download_all.pyi:7: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_download_all.pyi:7: error: Module has no attribute "download_all" [attr-defined]
- tests/datasets/test_download_all.pyi:8: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_download_all.pyi:8: error: Module has no attribute "download_all" [attr-defined]
- tests/datasets/test_download_all.pyi:9: error: Expression is of type "Any", not "None" [assert-type]
- tests/datasets/test_download_all.pyi:9: error: Module has no attribute "download_all" [attr-defined]
|
Ref #933
Instead of processing SCCs layer by layer, we will now process an SCC as soon as it is ready. This logic is easier to adapt for parallel processing, and should get us more benefit from parallelization (as more SCCs can be processed in parallel). I tried to make order with single worker stable and very similar (or maybe even identical) to the current order.
Note I already add some methods to the build manager to emulate parallel processing, but they are not parallel yet.