-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Scatter Plot Graph and Scaling can handle metas #2699
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so much simpler. All widgets should use models for combos/lists of variables.
Orange/widgets/utils/scaling.py
Outdated
no_jit[index] /= 2 * len(attr.values) | ||
c *= 2 | ||
c += 1 | ||
c /= 2 * len(attr.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't (2x+1)/(2a) same as (x + 0.5)/a? Aren't the above three lines the same as
c += 0.5
c /= len(attr.values)
After rewriting it this way, I also understand what the code actually does. :) Moves the data into the middle of intervals and normalizes it to [0, 1].
I know this was here before, but only God knows why it was written like this.
Orange/widgets/utils/scaling.py
Outdated
|
||
def _compute_jittered_data(self): | ||
data = self.data | ||
self.jittered_data = self.scaled_data.copy() | ||
random = np.random.RandomState(seed=self.jitter_seed) | ||
for index, col in enumerate(self.jittered_data): | ||
domain = self.domain | ||
for attr in domain.attributes + domain.class_vars + domain.metas: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use chain(domain, domain.metas)
in such situations.
Orange/widgets/utils/scaling.py
Outdated
self.data = data | ||
domain = data.domain | ||
self.primitives = data.domain.attributes + data.domain.class_vars + \ | ||
tuple([v for v in data.domain.metas if v.is_primitive()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use chain(data.domain, (v for v in data.domain.metas if v.is_primitive()))
.
If you keep it as it is: don't use tuple([...])
: just use generators, tuple(...)
, without constructing a list first.
Orange/widgets/utils/scaling.py
Outdated
tuple([v for v in data.domain.metas if v.is_primitive()]) | ||
new_domain = Domain(attributes=domain.attributes, | ||
class_vars=domain.class_vars, | ||
metas=tuple([v for v in domain.metas if v.is_primitive()])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same - no need to construct a list first.
Orange/widgets/utils/scaling.py
Outdated
col = self.jittered_data.get_column_view(attr)[0] | ||
col = 1 - col | ||
col = self.scaled_data.get_column_view(attr)[0] | ||
col = 1 - col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These four lines have no effect, do they?
Did you mean something like this?
col *= -1
col += 1
Or (shorter, but maybe slower) col[:] = 1 - col
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the latter one can consume even 4 times more time.
|
||
attributes = self.data.domain.attributes + (self.variable_x, self.variable_y) + \ | ||
primitive_metas | ||
attributes = data.domain.attributes + (self.variable_x, self.variable_y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BlazZupan, do we want the columns that contain x and y coordinates from the MDS to appear as ordinary attributes or as metas?
They used to be in metas, but now that visualization widgets can show metas as well, I would tend to store x and y as metas to not pollute that feature set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this line here is not related to that (it's for constructing data for plotting).
Output data has new coordinates in metas as we want, see commit function below (L671)
Orange/widgets/utils/scaling.py
Outdated
attr = self.domain[index] | ||
getCached(self.data, DomainBasicStats, (self.data, True)) | ||
domain = self.domain | ||
for attr in domain.attributes + domain.class_vars + domain.metas: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe chain(domain, domain.metas)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be sincere, I would prefer domain.all_variables
or something similar. Like we have domain.variables = domain.attributes + domain.class_vars
we could have something that includes all three: attributes
, class_vars
and metas
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate this, too.
When one gets older, (s)he start adding historic notes to everything. So let me say that in the beginning, like 20 years ago, Orange used to have attributes and classes, and both together were called variables. Then I introduced meta attributes, and they didn't mean the same thing as today: meta attributes were (essentially) sparse, while attributes and classes were dense ("essentially", because there was no numpy, these were C++ structures). Since all code - before the introduction of metas - expected that variables were attributes + classes, I would break everything if I added metas there. It also made more sense than today since variables (=attributes + classes) were the dense part of the data.
The idea that any part can be dense or sparse, and that metas are not so different from attributes and so forth evolved gradually even in Orange 3. Hence we have not initially decided that variables
will be attributes + class + meta. To make things worse, we (most probably: I) extended this bad design to iteration over Domain
. @kernc (I think) correctly pointed out that since var in domain
returns True
also when var
is in domain.metas
, so for var in domain
should iterate over attributes, classes and also metas. So in all the places where you could use chain(domain, domain.metas)
in your code, just domain
should suffice instead. Alas, fixing the iteration over Domain
would break so much code that we gave up. This horrible design decision stays forever, I fear.
I don't like all_variables
. variables
should already be all, so all_variables
sounds almost like really_all_variables
. Plus, the next thing you'll want will be primitive_variables
. And maybe also continuous_variables
. I wouldn't like to multiply the number of domains properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of brainstorming - and if this is indeed necessary - what about having a method Domain.all(condition)
. If condition
is a type derived from Variable
, all
returns all variables of that type (and derived types). Hence , Domain.all(DiscreteVariable)
returns all discrete variables. We then add PrimitiveVariable
as a new common base class for DiscreteVariable
and ContinuousVariable
. With that, Domain.all(PrimitiveVariable)
returns all primitive variables. Finally, the default value for condition
is Variable
, so Domain.all()
returns attributes + class_vars + metas.
Method all
could also accept other types of conditions if we can think of any.
Just for brainstorming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, I implemented it.
Orange/widgets/utils/scaling.py
Outdated
""" | ||
Get array of 0 and 1 of len = len(self.data). If there is a missing | ||
value at any attribute in indices return 0 for that instance. | ||
""" | ||
if self.valid_data_array is None or len(self.valid_data_array) == 0: | ||
return np.array([], np.bool) | ||
domain = self.domain | ||
indices = [] | ||
for index, attr in enumerate(domain.attributes + domain.class_vars + domain.metas): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enumerate(chain(domain, domain.metas))
?
Orange/data/domain.py
Outdated
@@ -211,6 +211,9 @@ def variables(self): | |||
def metas(self): | |||
return self._metas | |||
|
|||
def all(self, condition=Variable): | |||
return [v for v in chain(self._variables, self._metas) if isinstance(v, condition)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.all()
is not good. The idiom means something completely different in Python/NumPy context. .filter()
would be better, .where()
would be best.
This change should require re-touching everywhere the former ([]+[]
or chain
) idiom was used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like where
. :)
all
reads well: I want all discrete variables, hence domain.all(DiscreteVariable)
. But I agree the name is wrong wrt its usual meaning in Python.
domain.where(DiscreteVariable)
doesn't read so well and it also means two different things in numpy, one of which is similar to this Domain.where
(but with different arguments) and one which is completely different.
domain.filter(DiscreteVariable)
is sort of fine, but it gives an impression that it will return a new instance of Domain
with just discrete variables.
What about domain.collect(DiscreteVariable)
? Or domain.pick(DiscreteVariable)
-- though I prefer collect
?
Codecov Report
@@ Coverage Diff @@
## master #2699 +/- ##
==========================================
- Coverage 76.06% 76.05% -0.02%
==========================================
Files 338 338
Lines 59675 59670 -5
==========================================
- Hits 45391 45381 -10
- Misses 14284 14289 +5 |
Orange/data/domain.py
Outdated
@@ -211,6 +211,9 @@ def variables(self): | |||
def metas(self): | |||
return self._metas | |||
|
|||
def where(self, condition=Variable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do this in a separate PR?
While I agree that this change is neatly hidden here, I would prefer to have modifications to Variables / Domain in a separate, clearly marked PR (where we can discuss different alternatives).
Orange/widgets/utils/scaling.py
Outdated
attr = self.domain[index] | ||
getCached(self.data, DomainBasicStats, (self.data, True)) | ||
domain = self.domain | ||
for attr in domain.where(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
cols = [] | ||
for var in attrs: | ||
cols.append(graph.jittered_data.get_column_view(var)[0][valid]) | ||
X = np.vstack(cols).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X = np.column_stack(cols)
all_data = (data.X, Y, data.metas) | ||
else: | ||
all_data = (data.X, Y) | ||
all_data = np.hstack(all_data).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can have dtype==object when stacking with metas which will cause np.isfinite to fail 2 lines below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here, but I get similar errors elsewhere (see sentry report).
Try fixing all metas dtype problems (converting all primitive metas to float might be enough, maybe something else is also needed). Then please test thoroughly! Use at least the workflow from the sentry report.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found another error caused by Corpus (issue: biolab/orange3-text#324 )
[ENH] Scatter Plot Graph and Scaling can handle metas
Issue
Scatter Plot moves primitive meas to attributes.
Description of changes
Includes