-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] Add type hints in Dask package #3866
Conversation
@ffineis if you have time, could you take a look at this diff too? Since you've recently worked with the internals of the Dask module |
Yep on it tonight |
python-package/lightgbm/dask.py
Outdated
y: _DaskCollection, | ||
sample_weight: Optional[_DaskCollection] = None, | ||
init_score: Optional[_DaskCollection] = None, | ||
group: Optional[_1DArrayLike] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this might be group: Optional[_DaskCollection]
, group array is still distributed at this point. Since group
can be so much smaller than both X
and y
, I think supporting a locally-defined group
input list or array is a noble cause, but this would need to be its own PR in which it was defined which parts of group
get sent where to accompany the X
and y
parts.
If this comment stands, then _1DArrayLike
can be removed at the top of this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooooo ok! I misunderstood what was happening. Awesome, I'll look again and then change that hint.
I'll also write up a feature request for this. I think it's something that's non-breaking and additive, that could be done after 3.2.0. But like you said, since group
is often fairly small, I think it could be a nice thing to be able to specify it as a list or lil numpy array.
python-package/lightgbm/dask.py
Outdated
params: Dict[str, Any], | ||
model_factory: Type[LGBMModel], | ||
sample_weight: Optional[_DaskCollection] = None, | ||
group: Optional[_1DArrayLike] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group: Optional[_DaskCollection] = None,
Would be great if the docs also reflected that sample_weight
and group
are, up to this point, still distributed vectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the docs in this PR too, I think that's directly related to this scope. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking through this, I realized that with the way we do doc inheritance, it will actually be a bit tricky to override the docs for group
.
I wrote up my thoughts in #3871, but I won't update the docs (except those on internal functions) in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the hint in https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv
python-package/lightgbm/dask.py
Outdated
model = model_factory(**self.get_params()) | ||
self._copy_extra_params(self, model) | ||
return model | ||
|
||
@staticmethod | ||
def _copy_extra_params(source, dest): | ||
def _copy_extra_params(source: "_DaskLGBMModel", dest: "_DaskLGBMModel") -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that dest
could refer to both an LGBMModel
or a _DaskLGBMModel
, referring to line to 473 (in _to_local
).
def _copy_extra_params(source: "_DaskLGBMModel", dest: Union["_DaskLGBMModel", LGBMModel]) -> None:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shoot, you're totally right. good eye
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in bfd9dc0
python-package/lightgbm/dask.py
Outdated
@@ -102,7 +107,7 @@ def _find_ports_for_workers(client: Client, worker_addresses: Iterable[str], loc | |||
return worker_ip_to_port | |||
|
|||
|
|||
def _concat(seq): | |||
def _concat(seq: Iterable[_DaskPart]) -> _DaskPart: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that technically any Iterable
(tuple/list/np.array/...) could work for seq
here, but what is the benefit to using Iterable
and not List
? I guess you would never use the List
type hint for an input parameter (i.e. always prefer Iterable
over List
) unless the function called some list
-specific method like .sort
or mutability...?
I only raise this because I came across this S/O comment: https://stackoverflow.com/a/52827511/14480058, not sure what to think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't have any special reason for choosing Iterable
instead of List
. I was just looking at the code and didn't see any list-specific stuff so I thought it made sense to use the more general thing.
But since this is a totally-internally function, where we control the input, and since Iterable
does weird stuff as that S/O post points out, I'll change this to List[]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in bfd9dc0
python-package/lightgbm/dask.py
Outdated
dtype=np.float32, **kwargs): | ||
def _predict( | ||
model: LGBMModel, | ||
data: _DaskCollection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The corresponding docstr for the data
parameter reads dask array of shape = [n_samples, n_features]
- consider changing it to data : dask array or dask DataFrame of shape = [n_samples, n_features]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup will do in this PR, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in bfd9dc0
@jameslamb |
oh sure, no problem. done in b20ac37 |
The type hints are currently shown in method signatures on the docs site: I think this makes the method signature really hard to read. I'm going to try disabling this by setting |
Don't forget to bump minimum required Sphinx version then https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#confval-autodoc_typehints Line 77 in b4b1b75
|
Oh thanks! Didn't know about that. I though the version was only managed in LightGBM/docs/requirements_rtd.txt Line 2 in b4b1b75
I'll update that. But happy to say the builds on RTD did pass, and the type hints were successfully hidden: https://lightgbm.readthedocs.io/en/docs-jlamb/pythonapi/lightgbm.DaskLGBMClassifier.html#lightgbm.DaskLGBMClassifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure my comments will be useful, but anyway: 🙂
Co-authored-by: Nikita Titov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except two comment below and one unresolved question from previous review: #3866 (comment).
Thanks a lot!
Co-authored-by: Nikita Titov <[email protected]>
@jameslamb I'm not re-running the failed job to let you collect all details from log you may want to get familiar. According to #3829 (comment) I'm going to increase |
Thanks. I moved the discussion over to #3829 and re-opened it. I pushed an empty commit here to re-run CI, since I don't have rights to re-run jobs in Azure. |
Ok since #3866 (comment) has been moved to #3881 , I think this PR can be merged. Thanks for the reviews! |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
As part of #3756 , this PR proposes type hints for the Dask package.
This PR attempts to document the current state of the package, and isn't a proposal about the final state for release 3.2.0. So, for example, right now
X
andy
in.predict()
have to be Dask collections, but I could see it being useful to allow other data types in the future. I'd like to defer that discussion to other issues / PRs.Notes for Reviewers
docs/jlamb
branch because that branch already has readthedocs build enabled, and I want to see what effect, if any, this change has on the documentationisinstance()
checks on the types of arguments tofit()
andpredict()
that should be Dask collections, as mentioned in Support DataTable in Dask #3830 (comment), but I'm proposing that that be a separate PR**kwargs
is notDict[str, Any]
. I've learned that the hint for*
and**
stuff is expected to be the set of allowable types for the VALUES. https://www.python.org/dev/peps/pep-0484/#arbitrary-argument-lists-and-default-argument-valuesType[]
in a hint says "this should be a class, not an instance". Sodf_class: Type[pd.DataFrame]
means you passpd.DataFrame
or a sub-class, wheredf: pd.DataFrame
means a constructed data frame is expected.I ran
mypy python-package/lightgbm/dask.py
. There are still somemypy
issues but I don't think any are a direct result of this PR and some come from LightGBM's dependencies. mypy can also get easily confused when you use mixin classes, which we do a lot in the Dask package and the sklearn classes it imports. So I'm not worrying too much about the mypy stuff below right now (other than the one I fixed in #3865 ).mypy results for the curious