-
-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added TfidfTransformer and TfidfVectorizer to feature_extraction.text #869
base: main
Are you sure you want to change the base?
Conversation
just the skeleton (no tests yet)
@@ -12,9 +12,13 @@ | |||
import scipy.sparse | |||
import sklearn.base | |||
import sklearn.feature_extraction.text | |||
import sklearn.preprocessing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import is needed for its normalize()
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does sklearn.processing.normalize eagerly return a NumPy array? Or does it operate lazily on Dask Arrays?
If it's eager, we would need to reimplement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I use sklearn.processing.normalize()
only through dask.array.Array.map_blocks()
, the operation is lazy even though sklearn.processing.normalize()
is itself not lazy.
I hope this is acceptable.
dask_ml/feature_extraction/text.py
Outdated
params = self.get_params() | ||
subclass_instance_params = self.get_params() | ||
excluded_keys = getattr(self, '_non_CountVectorizer_params', []) | ||
params = {key: subclass_instance_params[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my "patch-up" solution to get params
to hold only parameters from CountVectorizer
and not its subclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about a parent class needing to know about the details of its subclasses.
Is it possible for each subclass to override get_params
to do the right thing?
-------- | ||
sklearn.feature_extraction.text.TfidfTransformer | ||
|
||
Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These examples have worked for me.
-------- | ||
sklearn.feature_extraction.text.TfidfVectorizer | ||
|
||
Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These examples have worked for me.
Tests have now been added. |
I have also just now added support for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started to review, it'll be a while before I can finish.
Can you share a bit about:
- How does this scale for large inputs
- Where all does computation occur during initialization, fitting, and transforming, and why can't it be done lazily?
.gitignore
Outdated
@@ -122,3 +122,5 @@ docs/source/auto_examples/ | |||
docs/source/examples/mydask.png | |||
|
|||
dask-worker-space | |||
/.project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend putting this in a global gitignore file: https://stackoverflow.com/questions/7335420/global-git-ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for the recommendation. I was unaware of this trick.
@@ -12,9 +12,13 @@ | |||
import scipy.sparse | |||
import sklearn.base | |||
import sklearn.feature_extraction.text | |||
import sklearn.preprocessing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does sklearn.processing.normalize eagerly return a NumPy array? Or does it operate lazily on Dask Arrays?
If it's eager, we would need to reimplement it.
dask_ml/feature_extraction/text.py
Outdated
aggregate=np.sum, | ||
axis=0, | ||
concatenate=False, | ||
dtype=dtype).compute().astype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the astype needed? Shouldn't passing dtype
to reduction ensure it's already the right type?
Also, do we need to compute
in this function? Or can it be done lazily (I haven't looked at how this is used yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I should have removed those.
dask_ml/feature_extraction/text.py
Outdated
params = self.get_params() | ||
subclass_instance_params = self.get_params() | ||
excluded_keys = getattr(self, '_non_CountVectorizer_params', []) | ||
params = {key: subclass_instance_params[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about a parent class needing to know about the details of its subclasses.
Is it possible for each subclass to override get_params
to do the right thing?
dask_ml/feature_extraction/text.py
Outdated
*vocabularies.to_delayed() | ||
) | ||
vocabulary = vocabulary_for_transform = ( | ||
_merge_vocabulary( *vocabularies.to_delayed() )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it'll cause a linting error. The contributing docs should have some info about setting up pre-commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check the contributing docs. Does this repo execute Action Workflow Scripts that also lint PRs? If so, that would make it easier to standardize coding style.
dask_ml/feature_extraction/text.py
Outdated
result = raw_documents.map_partitions( | ||
_count_vectorizer_transform, vocabulary_for_transform, params) | ||
result = build_array(result, n_features, meta) | ||
result.compute_chunk_sizes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? Ideally we avoid all unnecessary computation.
Thanks for your review. From your comments I realize that the dask programming paradigm is to delay all computations until such a time that the user executes |
If possible, yes. But sometimes intermediate computation is inevitable. |
True. I've cleaned things up a bit now — all unnecessary calls to Also, Currently all tests are passing. The one outstanding issue regards the 'worrying' side-effect of using |
0a20f5c
to
1fa55ca
Compare
By the way, I do not have access to a cluster, so I'm not sure how the code scales with the cluster-size. I merely presumed that if I wrote code similar to that of already existing dask-ml's If you know of any way I can test the code in a truly distributed environment, kindly let me know. |
get_CountVectorizer_params()
@@ -166,10 +215,35 @@ class CountVectorizer(sklearn.feature_extraction.text.CountVectorizer): | |||
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] | |||
""" | |||
|
|||
def get_CountVectorizer_params(self, deep=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about a parent class needing to know about the details of its subclasses.
Is it possible for each subclass to override get_params to do the right thing?
How about this? I've instead added a new method to CountVectorizer
called .get_CountVectorizer_params()
whose implementation is a slight modification of the original .get_params()
of the sklearn.Base.BaseEstimator
class but which does what is expected. Subclasses do not need to override it. Moreover, CountVectorizer
does not get to "know" the parameters of its subclasses. I hope this is acceptable.
Hi,
Thanks to all dask-developers for your outstanding work!
In this PR, I have attempted to apply my rudimentary knowledge of dask to include dask-implementations of
TfidfTransformer
andTfidfVectorizer
(found in thesklearn.feature_extraction.text
module) in thedask-ml.feature_extraction.text
module. For now, just a minimal working code (no unit-test yet) is available. Though the examples I've hard-coded into their docstrings should be able to run without incident.I think it requires someone with proper dask-expertise to inspect it and give me some pointers. In the meantime, I'll draw-up the tests.
Hopefully, this will prove to be a useful extension to dask-ml. At least it would for me, if it would eventually be merged upstream.