-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support intersect all and except distinct/all in DataFrame API #3537
base: main
Are you sure you want to change the base?
feat: Support intersect all and except distinct/all in DataFrame API #3537
Conversation
CodSpeed Performance ReportMerging #3537 will improve performances by 31.36%Comparing Summary
Benchmarks breakdown
|
@universalmind303 @kevinzwang would you mind to take a look at this when you have some time? This PR might be split into smaller one and need some docs and test refinements. I'd like to get some feedback early before going further. BTW, sorry for being away for quite long time as I'm kind busy theses weeks and it took me some time to figure out how to add the |
.zip(left_cols.iter()) | ||
.map(|(r, l)| r.alias(l.name())) | ||
.collect::<Vec<ExprRef>>(); | ||
let virtual_col_l = "__v_col_l"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the suffix __
is added to avoid potential name conflicts with names in Dataframe.
@@ -1657,6 +1657,7 @@ class LogicalPlanBuilder: | |||
) -> LogicalPlanBuilder: ... | |||
def concat(self, other: LogicalPlanBuilder) -> LogicalPlanBuilder: ... | |||
def intersect(self, other: LogicalPlanBuilder, is_all: bool) -> LogicalPlanBuilder: ... | |||
def except_(self, other: LogicalPlanBuilder, is_all: bool) -> LogicalPlanBuilder: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there's a better name for this, it would be really appreciated.
tests/conftest.py
Outdated
@@ -183,6 +183,16 @@ def assert_df_equals( | |||
print(f"Failed assertion for col: {col}") | |||
raise | |||
|
|||
def check_answer(df: daft.DataFrame, expected_answer: Dict[str, Any], is_sorted: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need this kind of test helper to reduce boilerplate code.
b5e4e7c
to
0edf691
Compare
@universalmind303 Hi Cory, I think it's ready for the first round of review. Really appreciate it if you can take a look at this and let me know what changes need to be made. |
0d81ac4
to
cae1b42
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3537 +/- ##
==========================================
+ Coverage 77.84% 77.99% +0.15%
==========================================
Files 718 720 +2
Lines 88250 88778 +528
==========================================
+ Hits 68696 69246 +550
+ Misses 19554 19532 -22
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few minor things, but overall looks good! Let me know once these are addressed and I can go ahead and merge this in. Thanks again @advancedxy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add a couple tests for listfill in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, PTAL.
} | ||
|
||
fn name(&self) -> &'static str { | ||
"fill" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
"fill" | |
"list_fill" |
let one_lit = lit(1); | ||
let left_v_cnt = col(virtual_col_l) | ||
.count(CountMode::Valid) | ||
.alias("__v_l_cnt"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we change these static column names into consts, and put them all next to eachother.
src/daft-core/src/array/ops/list.rs
Outdated
@@ -255,6 +255,31 @@ fn list_sort_helper_fixed_size( | |||
.collect() | |||
} | |||
|
|||
fn general_list_fill_helper(element: &Series, num_array: &Int64Array) -> DaftResult<Vec<Series>> { | |||
let num_iter = create_iter(num_array, element.len()); | |||
let mut result = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we preallocate the capacity here?
let mut result = Vec::with_capacity(...)
hmm, I'm not sure why code speed is affected, it should not be related. |
This commits adds complete support for intersect and except operations. To support that, it's consisted of:
intersect_all
,except_distinct
andexcept_all
in python side's DataFrame APIlist_fill(num, val)
which will create a list of num elements with the providedval
The intersect/except all logic is a bit complex and it will require list_fill and explode to correctly compute the intersected or excepted rows.