Set Operations on two tables #761

san-r · 2021-11-28T07:23:37Z

san-r
Nov 28, 2021

This is with reference to set operations available in R, which may be seen in the second page of https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf under the section "Combine Data Sets" sub-section "Set Operations":

Suppose we have two tables as under:

y.csv

x1,x2
A,1
B,2
C,3

z.csv

x1,x2
B,2
C,3
D,4

Now the Set Operations:

intersect (Rows that appear in both y and z)

R command:
intersect(y, z)

x1	x2
B	2
C	3

Equivalent Miller command:
?

union (Rows that appear in either or both y and z)

R command:
union(y, z)

x1	x2
A	1
B	2
C	3
D	4

Equivalent Miller command:
?

setdiff (Rows that appear in y but not z)

R command:
setdiff(y, z)

x1	x2
A	1

Equivalent Miller command:
?

Can anyone please post the Miller equivalent of these operations.

aborruso · 2021-11-28T09:22:46Z

aborruso
Nov 28, 2021

Hi @san-r ,
I don't think there is the equivalent way to do it, but here are join and cat based modes

intersect

mlr --c2p --barred join -j x1,x2 -f y.csv then unsparsify z.csv

+----+----+
| x1 | x2 |
+----+----+
| B  | 2  |
| C  | 3  |
+----+----+

union

mlr --c2p --barred uniq -a y.csv z.csv

+----+----+
| x1 | x2 |
+----+----+
| A  | 1  |
| B  | 2  |
| C  | 3  |
| D  | 4  |
+----+----+

setdiff

mlr --c2p --barred join --np --ul -j x1,x2 -f y.csv then unsparsify z.csv

+----+----+
| x1 | x2 |
+----+----+
| A  | 1  |
+----+----+

0 replies

san-r · 2021-11-28T12:29:18Z

san-r
Nov 28, 2021
Author

Thanks for quick reply. Here are my observations:

intersect

We can use inner-join method to get the result. We need to list all column names with the -j parameter, which may be difficult if working with tables with hundreds of columns. Also, I think "then unsparsify" is not needed here since both tables have same column names.

union

Using "uniq -a" on the listed multiple csv files does the job.
mlr --c2p --barred uniq -a y.csv z.csv

Another way which does the same thing in two steps is:
mlr --csv cat y.csv z.csv | mlr --c2p --barred uniq -a
Pipe is used here, "then" chaining will not work.

setdiff

We can use anti-join method to get the result. We need to list all column names with the -j parameter, which may be difficult if working with tables with hundreds of columns. Also, I think "then unsparsify" is not needed here since both tables have same column names.

So we know the workaround methods. Thanks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set Operations on two tables #761

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Set Operations on two tables #761

san-r Nov 28, 2021

y.csv

z.csv

Now the Set Operations:

intersect (Rows that appear in both y and z)

union (Rows that appear in either or both y and z)

setdiff (Rows that appear in y but not z)

Replies: 2 comments

aborruso Nov 28, 2021

intersect

union

setdiff

san-r Nov 28, 2021 Author

intersect

union

setdiff

san-r
Nov 28, 2021

aborruso
Nov 28, 2021

san-r
Nov 28, 2021
Author