-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Merge data allows matching by multiple pairs of columns #3919
[ENH] Merge data allows matching by multiple pairs of columns #3919
Conversation
66d363a
to
1130ca1
Compare
1130ca1
to
2338a2a
Compare
Codecov Report
@@ Coverage Diff @@
## master #3919 +/- ##
==========================================
- Coverage 84.95% 84.83% -0.12%
==========================================
Files 376 371 -5
Lines 65907 65631 -276
==========================================
- Hits 55988 55677 -311
- Misses 9919 9954 +35 |
Codecov Report
@@ Coverage Diff @@
## master #3919 +/- ##
==========================================
+ Coverage 85.03% 85.12% +0.09%
==========================================
Files 378 378
Lines 66706 67117 +411
==========================================
+ Hits 56722 57134 +412
+ Misses 9984 9983 -1 |
2338a2a
to
f4461a6
Compare
Merging by several variables works quite fine. There was one extreme case where my second data was bigger than the first one and it had missing values for both columns I matched on. Then the output contained duplicate columns, even though the matching was unique. Not such a big deal. We can of course merge this and fix some stuff later. What would be nice is to try to match the column names. I have, say, one column Class in my first data and a column Class in my second data. When I set Class as the first variable, the widget could propose Class from the second data set to match on. Now it doesn't and reports an error Row index cannot be matched by Class. Well yeah, but I didn't yet set all my variables. Having so many errors while still being in the process of setting the variables is a bit intimidating. Also, it is not entirely clear that the bottom part also applies to the two previous methods (Append columns and Find matching). Perhaps the gui could be made clearer? (not sure how, though) Finally, to really make the most of this widget it would be nice to have Unique widget in Orange instead of in Prototypes. This widget enables the user to provide data with unique values only, which then can be used in Merge Data. |
The changes look and work great for me. I have not found any new issue beside things @ajdapretnar has found. In my opinion, it can be merged and thinks can be fixed later. For me, the most critical thing is that it seems that attributes fields at the bottom look like they are only available for the last selection I would say that you and @ajdapretnar decide whether issues are addressed as a part of this PR or it is merged and issues are addressed in the next PR. |
dbf5ac4
to
0e5db2a
Compare
Fixed. I implemented matching of names in the combos: if the other combo was left at default (well, if it contains Row Index or instance Id), and there is a matching variable (same name and compatible type), the other combo will change. I moved the combos to a separate box. It looks much better. I also added tooltips and a hint to see the tooltips. It is intrusive, but the options are so difficult to understand that it's better to have it. As for the problem described by @ajdapretnar, I'd have to see the data. If it requires more debugging, I'd prefer a separate issue, perhaps to be solved by Vesna. I'm a bit incapacitated right now; I prefer typing over reading, so debugging longer pieces of code is a bit of struggle. :) |
0e5db2a
to
6a2d31f
Compare
Matching of names is fantastic! Love it! I vote for merging this. |
Sorry, but had to revoke the Approve. I found a fairly serious bug while writing the documentation. I tried it with a data set, where I am adding coordinates to cities. In the Extra Data I have also Belgrade, which doesn't make it into the Data Table, even though it should with Concatenate tables. When connecting the output to Scatter Plot, it crashes. in_data.X shows correct values, but in_data.ids doesn't (only 6 instances shown). |
6a2d31f
to
14d2cf0
Compare
Wow, thanks for the hint: the problem were indeed the missing ids. You saved me so much time! Interestingly, this was an old bug: line |
14d2cf0
to
480844e
Compare
480844e
to
690c9d1
Compare
This issue is old and moot. If the same variable appears in both tables and you're not matching by this variable, it may have mismatching values. We ignore this and use the value from the primary table. Your expectation is however reasonable, so I implemented a fix. This will only apply to concatenated rows. If a city name in the primary table is missing (and you're matching by some other variable, obviously), the name will not be copied from the secondary table, because if it worked like this, it would also have to warn about mismatching values, but I wouldn't like to go there. At least not in this PR. :) |
Issue
Fixes #2121 by allowing multiple pairs of columns to be matched.
As a consequence, we require that the combination of values is unique, so combos list all available attributes. These lists are unfortunately longer, yet the upside is that the widget does not show a warning every time when it gets data with non-unique attributes (which is ... always). One may argue that it would be nice to show to the user which columns are useful for merging, but the user probably has a very good idea of what (s)he wants to match.
Includes