Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support intersect all and except distinct/all in DataFrame API #3537

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

advancedxy
Copy link
Contributor

This commits adds complete support for intersect and except operations. To support that, it's consisted of:

  1. Expose new methods such as intersect_all, except_distinct and except_all in python side's DataFrame API
  2. Add Except operator in set_operations and convert corresponding logical plans
  3. Complete intersect all branch with similar logic with except all
  4. Add a new scalar function: list_fill(num, val) which will create a list of num elements with the provided val

The intersect/except all logic is a bit complex and it will require list_fill and explode to correctly compute the intersected or excepted rows.

@advancedxy advancedxy changed the title [FEAT]: Support intersect all and except distinct/all in DataFrame API feat: Support intersect all and except distinct/all in DataFrame API Dec 10, 2024
@github-actions github-actions bot added the feat label Dec 10, 2024
Copy link

codspeed-hq bot commented Dec 10, 2024

CodSpeed Performance Report

Merging #3537 will improve performances by 31.36%

Comparing advancedxy:intersect_all_and_except_support (5a8ee7b) with main (b87e0a3)

Summary

⚡ 1 improvements
✅ 26 untouched benchmarks

Benchmarks breakdown

Benchmark main advancedxy:intersect_all_and_except_support Change
test_iter_rows_first_row[100 Small Files] 257.3 ms 195.9 ms +31.36%

@advancedxy
Copy link
Contributor Author

@universalmind303 @kevinzwang would you mind to take a look at this when you have some time? This PR might be split into smaller one and need some docs and test refinements. I'd like to get some feedback early before going further.

BTW, sorry for being away for quite long time as I'm kind busy theses weeks and it took me some time to figure out how to add the list_fill scalar function. Daft has too many encapsulations around the series, arrays, arrow_array and etc.

@universalmind303 universalmind303 self-requested a review December 10, 2024 18:12
.zip(left_cols.iter())
.map(|(r, l)| r.alias(l.name()))
.collect::<Vec<ExprRef>>();
let virtual_col_l = "__v_col_l";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the suffix __ is added to avoid potential name conflicts with names in Dataframe.

@@ -1657,6 +1657,7 @@ class LogicalPlanBuilder:
) -> LogicalPlanBuilder: ...
def concat(self, other: LogicalPlanBuilder) -> LogicalPlanBuilder: ...
def intersect(self, other: LogicalPlanBuilder, is_all: bool) -> LogicalPlanBuilder: ...
def except_(self, other: LogicalPlanBuilder, is_all: bool) -> LogicalPlanBuilder: ...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's a better name for this, it would be really appreciated.

@@ -183,6 +183,16 @@ def assert_df_equals(
print(f"Failed assertion for col: {col}")
raise

def check_answer(df: daft.DataFrame, expected_answer: Dict[str, Any], is_sorted: bool = False):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this kind of test helper to reduce boilerplate code.

@advancedxy advancedxy force-pushed the intersect_all_and_except_support branch from b5e4e7c to 0edf691 Compare December 19, 2024 15:44
@advancedxy
Copy link
Contributor Author

@universalmind303 Hi Cory, I think it's ready for the first round of review. Really appreciate it if you can take a look at this and let me know what changes need to be made.

@advancedxy advancedxy force-pushed the intersect_all_and_except_support branch from 0d81ac4 to cae1b42 Compare December 19, 2024 16:12
Copy link

codecov bot commented Dec 19, 2024

Codecov Report

Attention: Patch coverage is 92.55583% with 30 lines in your changes missing coverage. Please review.

Project coverage is 77.99%. Comparing base (ae74c10) to head (5a8ee7b).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-logical-plan/src/ops/set_operations.rs 87.04% 25 Missing ⚠️
src/daft-functions/src/list/list_fill.rs 96.47% 5 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3537      +/-   ##
==========================================
+ Coverage   77.84%   77.99%   +0.15%     
==========================================
  Files         718      720       +2     
  Lines       88250    88778     +528     
==========================================
+ Hits        68696    69246     +550     
+ Misses      19554    19532      -22     
Files with missing lines Coverage Δ
daft/dataframe/dataframe.py 85.47% <100.00%> (+0.18%) ⬆️
daft/logical/builder.py 90.22% <100.00%> (+0.53%) ⬆️
src/daft-core/src/array/ops/list.rs 95.32% <100.00%> (+0.22%) ⬆️
src/daft-core/src/series/ops/list.rs 63.47% <100.00%> (+0.66%) ⬆️
src/daft-logical-plan/src/builder.rs 92.22% <100.00%> (+0.09%) ⬆️
src/daft-functions/src/list/list_fill.rs 96.47% <96.47%> (ø)
src/daft-logical-plan/src/ops/set_operations.rs 72.82% <87.04%> (+37.78%) ⬆️

... and 12 files with indirect coverage changes

Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor things, but overall looks good! Let me know once these are addressed and I can go ahead and merge this in. Thanks again @advancedxy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a couple tests for listfill in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, PTAL.

}

fn name(&self) -> &'static str {
"fill"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
"fill"
"list_fill"

let one_lit = lit(1);
let left_v_cnt = col(virtual_col_l)
.count(CountMode::Valid)
.alias("__v_l_cnt");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we change these static column names into consts, and put them all next to eachother.

@@ -255,6 +255,31 @@ fn list_sort_helper_fixed_size(
.collect()
}

fn general_list_fill_helper(element: &Series, num_array: &Int64Array) -> DaftResult<Vec<Series>> {
let num_iter = create_iter(num_array, element.len());
let mut result = vec![];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we preallocate the capacity here?

let mut result = Vec::with_capacity(...)

@advancedxy
Copy link
Contributor Author

hmm, I'm not sure why code speed is affected, it should not be related.

@advancedxy advancedxy closed this Dec 20, 2024
@advancedxy advancedxy reopened this Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants