Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Merge data allows matching by multiple pairs of columns #3919

Merged
merged 11 commits into from
Jul 30, 2019

Conversation

janezd
Copy link
Contributor

@janezd janezd commented Jul 1, 2019

Issue

Fixes #2121 by allowing multiple pairs of columns to be matched.

As a consequence, we require that the combination of values is unique, so combos list all available attributes. These lists are unfortunately longer, yet the upside is that the widget does not show a warning every time when it gets data with non-unique attributes (which is ... always). One may argue that it would be nice to show to the user which columns are useful for merging, but the user probably has a very good idea of what (s)he wants to match.

Includes
  • Code changes
  • Tests
  • Documentation

@janezd janezd force-pushed the owmergedata-multiple-attrs branch from 66d363a to 1130ca1 Compare July 1, 2019 16:44
@janezd janezd changed the title Owmergedata multiple attrs [ENH] Merge data allows matching by multiple pairs of columns Jul 1, 2019
@janezd janezd force-pushed the owmergedata-multiple-attrs branch from 1130ca1 to 2338a2a Compare July 12, 2019 13:03
@codecov
Copy link

codecov bot commented Jul 12, 2019

Codecov Report

Merging #3919 into master will decrease coverage by 0.11%.
The diff coverage is 91.85%.

@@            Coverage Diff             @@
##           master    #3919      +/-   ##
==========================================
- Coverage   84.95%   84.83%   -0.12%     
==========================================
  Files         376      371       -5     
  Lines       65907    65631     -276     
==========================================
- Hits        55988    55677     -311     
- Misses       9919     9954      +35

@codecov
Copy link

codecov bot commented Jul 12, 2019

Codecov Report

Merging #3919 into master will increase coverage by 0.09%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3919      +/-   ##
==========================================
+ Coverage   85.03%   85.12%   +0.09%     
==========================================
  Files         378      378              
  Lines       66706    67117     +411     
==========================================
+ Hits        56722    57134     +412     
+ Misses       9984     9983       -1

@ajdapretnar
Copy link
Contributor

Merging by several variables works quite fine. There was one extreme case where my second data was bigger than the first one and it had missing values for both columns I matched on. Then the output contained duplicate columns, even though the matching was unique. Not such a big deal.

We can of course merge this and fix some stuff later. What would be nice is to try to match the column names. I have, say, one column Class in my first data and a column Class in my second data. When I set Class as the first variable, the widget could propose Class from the second data set to match on. Now it doesn't and reports an error Row index cannot be matched by Class. Well yeah, but I didn't yet set all my variables. Having so many errors while still being in the process of setting the variables is a bit intimidating.

Also, it is not entirely clear that the bottom part also applies to the two previous methods (Append columns and Find matching). Perhaps the gui could be made clearer? (not sure how, though)

Finally, to really make the most of this widget it would be nice to have Unique widget in Orange instead of in Prototypes. This widget enables the user to provide data with unique values only, which then can be used in Merge Data.

@PrimozGodec
Copy link
Contributor

The changes look and work great for me. I have not found any new issue beside things @ajdapretnar has found. In my opinion, it can be merged and thinks can be fixed later.

For me, the most critical thing is that it seems that attributes fields at the bottom look like they are only available for the last selection Concatenate tables. I would at least remove the indentation in front, but probably there is a better solution. Also, documentation will need to be updated later.

I would say that you and @ajdapretnar decide whether issues are addressed as a part of this PR or it is merged and issues are addressed in the next PR.

@janezd
Copy link
Contributor Author

janezd commented Jul 24, 2019

Fixed.

I implemented matching of names in the combos: if the other combo was left at default (well, if it contains Row Index or instance Id), and there is a matching variable (same name and compatible type), the other combo will change.

I moved the combos to a separate box. It looks much better. I also added tooltips and a hint to see the tooltips. It is intrusive, but the options are so difficult to understand that it's better to have it.

As for the problem described by @ajdapretnar, I'd have to see the data. If it requires more debugging, I'd prefer a separate issue, perhaps to be solved by Vesna. I'm a bit incapacitated right now; I prefer typing over reading, so debugging longer pieces of code is a bit of struggle. :)

@janezd janezd force-pushed the owmergedata-multiple-attrs branch from 0e5db2a to 6a2d31f Compare July 24, 2019 16:29
@ajdapretnar
Copy link
Contributor

Matching of names is fantastic! Love it!

I vote for merging this.

ajdapretnar
ajdapretnar previously approved these changes Jul 26, 2019
@ajdapretnar ajdapretnar self-requested a review July 26, 2019 12:44
@ajdapretnar
Copy link
Contributor

Sorry, but had to revoke the Approve. I found a fairly serious bug while writing the documentation.
Concatenate tables doesn't work as it should. See the screenshot below. Data Table reports 7 instances, 6 are shown.

Screen Shot 2019-07-26 at 14 40 53

I tried it with a data set, where I am adding coordinates to cities. In the Extra Data I have also Belgrade, which doesn't make it into the Data Table, even though it should with Concatenate tables. When connecting the output to Scatter Plot, it crashes.

in_data.X shows correct values, but in_data.ids doesn't (only 6 instances shown).

@janezd janezd force-pushed the owmergedata-multiple-attrs branch from 6a2d31f to 14d2cf0 Compare July 27, 2019 07:47
@janezd
Copy link
Contributor Author

janezd commented Jul 27, 2019

Wow, thanks for the hint: the problem were indeed the missing ids. You saved me so much time!

Interestingly, this was an old bug: line table.ids = self.data.ids (https://github.com/biolab/orange3/pull/3919/files#diff-a0b6ce572710e3beed1886b36308fc39L385) worked properly only for the first option (left join).

@janezd janezd force-pushed the owmergedata-multiple-attrs branch from 14d2cf0 to 480844e Compare July 27, 2019 08:25
@ajdapretnar
Copy link
Contributor

This solve the bug (crash), but does not yet work just as expected. Now the table looks as it should, but when merging on a column, the final city that is present only in the second data set gets a nan instead of a name.

Screen Shot 2019-07-29 at 11 50 16

@janezd janezd force-pushed the owmergedata-multiple-attrs branch from 480844e to 690c9d1 Compare July 30, 2019 11:25
@janezd
Copy link
Contributor Author

janezd commented Jul 30, 2019

This issue is old and moot.

If the same variable appears in both tables and you're not matching by this variable, it may have mismatching values. We ignore this and use the value from the primary table.

Your expectation is however reasonable, so I implemented a fix.

This will only apply to concatenated rows. If a city name in the primary table is missing (and you're matching by some other variable, obviously), the name will not be copied from the secondary table, because if it worked like this, it would also have to warn about mismatching values, but I wouldn't like to go there. At least not in this PR. :)

@ajdapretnar ajdapretnar merged commit 7c23b9a into biolab:master Jul 30, 2019
@ajdapretnar ajdapretnar mentioned this pull request Aug 1, 2019
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] OWMerge : Chose multiple variables in the 'where' condition
3 participants