[ENH] Merge data allows matching by multiple pairs of columns #3919

janezd · 2019-07-01T16:24:33Z

Issue

Fixes #2121 by allowing multiple pairs of columns to be matched.

As a consequence, we require that the combination of values is unique, so combos list all available attributes. These lists are unfortunately longer, yet the upside is that the widget does not show a warning every time when it gets data with non-unique attributes (which is ... always). One may argue that it would be nice to show to the user which columns are useful for merging, but the user probably has a very good idea of what (s)he wants to match.

Includes

Code changes
Tests
Documentation

codecov · 2019-07-12T13:03:36Z

Codecov Report

Merging #3919 into master will decrease coverage by 0.11%.
The diff coverage is 91.85%.

@@            Coverage Diff             @@
##           master    #3919      +/-   ##
==========================================
- Coverage   84.95%   84.83%   -0.12%     
==========================================
  Files         376      371       -5     
  Lines       65907    65631     -276     
==========================================
- Hits        55988    55677     -311     
- Misses       9919     9954      +35

codecov · 2019-07-12T13:03:36Z

Codecov Report

Merging #3919 into master will increase coverage by 0.09%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3919      +/-   ##
==========================================
+ Coverage   85.03%   85.12%   +0.09%     
==========================================
  Files         378      378              
  Lines       66706    67117     +411     
==========================================
+ Hits        56722    57134     +412     
+ Misses       9984     9983       -1

ajdapretnar · 2019-07-19T08:45:05Z

Merging by several variables works quite fine. There was one extreme case where my second data was bigger than the first one and it had missing values for both columns I matched on. Then the output contained duplicate columns, even though the matching was unique. Not such a big deal.

We can of course merge this and fix some stuff later. What would be nice is to try to match the column names. I have, say, one column Class in my first data and a column Class in my second data. When I set Class as the first variable, the widget could propose Class from the second data set to match on. Now it doesn't and reports an error Row index cannot be matched by Class. Well yeah, but I didn't yet set all my variables. Having so many errors while still being in the process of setting the variables is a bit intimidating.

Also, it is not entirely clear that the bottom part also applies to the two previous methods (Append columns and Find matching). Perhaps the gui could be made clearer? (not sure how, though)

Finally, to really make the most of this widget it would be nice to have Unique widget in Orange instead of in Prototypes. This widget enables the user to provide data with unique values only, which then can be used in Merge Data.

PrimozGodec · 2019-07-20T11:43:46Z

The changes look and work great for me. I have not found any new issue beside things @ajdapretnar has found. In my opinion, it can be merged and thinks can be fixed later.

For me, the most critical thing is that it seems that attributes fields at the bottom look like they are only available for the last selection Concatenate tables. I would at least remove the indentation in front, but probably there is a better solution. Also, documentation will need to be updated later.

I would say that you and @ajdapretnar decide whether issues are addressed as a part of this PR or it is merged and issues are addressed in the next PR.

janezd · 2019-07-24T13:12:03Z

Fixed.

I implemented matching of names in the combos: if the other combo was left at default (well, if it contains Row Index or instance Id), and there is a matching variable (same name and compatible type), the other combo will change.

I moved the combos to a separate box. It looks much better. I also added tooltips and a hint to see the tooltips. It is intrusive, but the options are so difficult to understand that it's better to have it.

As for the problem described by @ajdapretnar, I'd have to see the data. If it requires more debugging, I'd prefer a separate issue, perhaps to be solved by Vesna. I'm a bit incapacitated right now; I prefer typing over reading, so debugging longer pieces of code is a bit of struggle. :)

ajdapretnar · 2019-07-26T08:03:57Z

Matching of names is fantastic! Love it!

I vote for merging this.

Found bug.

ajdapretnar · 2019-07-26T12:47:41Z

Sorry, but had to revoke the Approve. I found a fairly serious bug while writing the documentation.
Concatenate tables doesn't work as it should. See the screenshot below. Data Table reports 7 instances, 6 are shown.

I tried it with a data set, where I am adding coordinates to cities. In the Extra Data I have also Belgrade, which doesn't make it into the Data Table, even though it should with Concatenate tables. When connecting the output to Scatter Plot, it crashes.

in_data.X shows correct values, but in_data.ids doesn't (only 6 instances shown).

…always strings

janezd · 2019-07-27T07:58:29Z

Wow, thanks for the hint: the problem were indeed the missing ids. You saved me so much time!

Interestingly, this was an old bug: line table.ids = self.data.ids (https://github.com/biolab/orange3/pull/3919/files#diff-a0b6ce572710e3beed1886b36308fc39L385) worked properly only for the first option (left join).

ajdapretnar · 2019-07-29T09:50:42Z

This solve the bug (crash), but does not yet work just as expected. Now the table looks as it should, but when merging on a column, the final city that is present only in the second data set gets a nan instead of a name.

… StringVariable)

janezd · 2019-07-30T11:38:01Z

This issue is old and moot.

If the same variable appears in both tables and you're not matching by this variable, it may have mismatching values. We ignore this and use the value from the primary table.

Your expectation is however reasonable, so I implemented a fix.

This will only apply to concatenated rows. If a city name in the primary table is missing (and you're matching by some other variable, obviously), the name will not be copied from the secondary table, because if it worked like this, it would also have to warn about mismatching values, but I wouldn't like to go there. At least not in this PR. :)

janezd force-pushed the owmergedata-multiple-attrs branch from 66d363a to 1130ca1 Compare July 1, 2019 16:44

janezd changed the title ~~Owmergedata multiple attrs~~ [ENH] Merge data allows matching by multiple pairs of columns Jul 1, 2019

janezd force-pushed the owmergedata-multiple-attrs branch from 1130ca1 to 2338a2a Compare July 12, 2019 13:03

janezd force-pushed the owmergedata-multiple-attrs branch from 2338a2a to f4461a6 Compare July 12, 2019 13:03

lanzagar assigned PrimozGodec Jul 19, 2019

PrimozGodec assigned ajdapretnar Jul 20, 2019

janezd force-pushed the owmergedata-multiple-attrs branch from dbf5ac4 to 0e5db2a Compare July 24, 2019 13:03

janezd force-pushed the owmergedata-multiple-attrs branch from 0e5db2a to 6a2d31f Compare July 24, 2019 16:29

ajdapretnar previously approved these changes Jul 26, 2019

View reviewed changes

ajdapretnar self-requested a review July 26, 2019 12:44

janezd added 9 commits July 27, 2019 09:47

OWMergeData: Allow merging by multiple columns

c12ba34

OWMergeData: Tests for uniqueness of combinations, not single columns

9852019

OWMergeData: Fix report, other minor fixes

969f043

OWMergData: Increase test coverage

1572677

OWMergData: Fix reloading schema settings

81b8db0

OWMergeData: Sync combos if the attribute with the same name exists

1fd20ba

OWMergeData: In searching for the best match, don't assume metas are …

ba38e73

…always strings

OWMergeData: Improve gui, add tooltips

4f6ac7d

OWMergeData: Improve test coverage

30c9e67

janezd force-pushed the owmergedata-multiple-attrs branch from 6a2d31f to 14d2cf0 Compare July 27, 2019 07:47

janezd force-pushed the owmergedata-multiple-attrs branch from 14d2cf0 to 480844e Compare July 27, 2019 08:25

janezd added 2 commits July 30, 2019 13:23

OWMergeData: Fix ids and duplicated columns in outer join

2bc5b69

Table: Properly iitialize missing values (empty strings, not nans for…

690c9d1

… StringVariable)

janezd force-pushed the owmergedata-multiple-attrs branch from 480844e to 690c9d1 Compare July 30, 2019 11:25

ajdapretnar merged commit 7c23b9a into biolab:master Jul 30, 2019

ajdapretnar mentioned this pull request Aug 1, 2019

Merge Data: extend documentation #3958

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Merge data allows matching by multiple pairs of columns #3919

[ENH] Merge data allows matching by multiple pairs of columns #3919

janezd commented Jul 1, 2019

codecov bot commented Jul 12, 2019

codecov bot commented Jul 12, 2019 •

edited

Loading

ajdapretnar commented Jul 19, 2019

PrimozGodec commented Jul 20, 2019

janezd commented Jul 24, 2019

ajdapretnar commented Jul 26, 2019

ajdapretnar commented Jul 26, 2019

janezd commented Jul 27, 2019

ajdapretnar commented Jul 29, 2019

janezd commented Jul 30, 2019

[ENH] Merge data allows matching by multiple pairs of columns #3919

[ENH] Merge data allows matching by multiple pairs of columns #3919

Conversation

janezd commented Jul 1, 2019

Issue

Includes

codecov bot commented Jul 12, 2019

Codecov Report

codecov bot commented Jul 12, 2019 • edited Loading

Codecov Report

ajdapretnar commented Jul 19, 2019

PrimozGodec commented Jul 20, 2019

janezd commented Jul 24, 2019

ajdapretnar commented Jul 26, 2019

ajdapretnar commented Jul 26, 2019

janezd commented Jul 27, 2019

ajdapretnar commented Jul 29, 2019

janezd commented Jul 30, 2019

codecov bot commented Jul 12, 2019 •

edited

Loading