Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have head() traverse all partitions #419

Merged
merged 6 commits into from
Apr 4, 2024
Merged

Have head() traverse all partitions #419

merged 6 commits into from
Apr 4, 2024

Conversation

wilsonbb
Copy link
Collaborator

@wilsonbb wilsonbb commented Mar 28, 2024

As was pointed out in #380, it is not uncommon for partitions to be empty, which can lead to unintuitive results for users since head by default only looks at the first partitions.

To make head(n) more intuitive for users we can have it search across all partitions stopping only once there are n rows to return. This differs from Dask behavior where a user must explicitly choose how many partitions to search across.

Warning: this removes the npartitions argument previously available in EnsembleFrame.head

  • My PR includes a link to the issue that I am addressing

Solution Description

Here we iterate over each partition similar to how head() until we have found enough rows similar to the implementation in LSDB

Code Quality

  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

New Feature Checklist

  • I have added or updated the docstrings associated with my feature using the NumPy docstring format
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover my new feature
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

github-actions bot commented Mar 28, 2024

Before [d0af3ce] After [a693f5d] Ratio Benchmark (Parameter)
33.5±0.4ms 33.6±0.5ms 1 benchmarks.time_batch
34.7±0.4ms 33.9±0.2ms 0.98 benchmarks.time_prune_sync_workflow

Click here to view all benchmarks.

Copy link

codecov bot commented Mar 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.82%. Comparing base (d0af3ce) to head (cd026ca).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #419      +/-   ##
==========================================
+ Coverage   95.78%   95.82%   +0.03%     
==========================================
  Files          25       25              
  Lines        1755     1771      +16     
==========================================
+ Hits         1681     1697      +16     
  Misses         74       74              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wilsonbb wilsonbb marked this pull request as ready for review March 28, 2024 23:45
@wilsonbb wilsonbb requested a review from dougbrn March 28, 2024 23:45
Copy link
Collaborator

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

src/tape/ensemble_frame.py Outdated Show resolved Hide resolved
@wilsonbb wilsonbb merged commit d72d1e7 into main Apr 4, 2024
10 checks passed
@wilsonbb wilsonbb deleted the new_head branch April 4, 2024 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants