Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support using general compression for numeric array #3020

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

niyue
Copy link
Contributor

@niyue niyue commented Oct 18, 2024

This PR aims to address issue #3019.

In this PR, a general compression encoder can now be selected for numeric arrays. I’ve reused the existing CompressedBufferEncoder struct from the codebase, which currently does not appear to be in use elsewhere.

@niyue niyue changed the title Support using general compression for numeric array feat: support using general compression for numeric array Oct 18, 2024
@github-actions github-actions bot added the enhancement New feature or request label Oct 18, 2024
compressed_bit_width as usize,
data_type.clone(),
)));
} else if let Some(compression) = field_meta.and_then(Self::get_field_compression) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When bitpacking is not applicable, check if field itself specifies some compression

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 93.33333% with 8 lines in your changes missing coverage. Please review.

Project coverage is 78.24%. Comparing base (f9024ce) to head (bb88df3).

Files with missing lines Patch % Lines
...st/lance-encoding/src/encodings/physical/binary.rs 61.53% 5 Missing ⚠️
...-encoding/src/encodings/physical/block_compress.rs 75.00% 2 Missing ⚠️
rust/lance-encoding/src/encoder.rs 98.98% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3020      +/-   ##
==========================================
+ Coverage   78.19%   78.24%   +0.04%     
==========================================
  Files         239      239              
  Lines       76782    76870      +88     
  Branches    76782    76870      +88     
==========================================
+ Hits        60043    60148     +105     
+ Misses      13669    13633      -36     
- Partials     3070     3089      +19     
Flag Coverage Δ
unittests 78.24% <93.33%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

let mut encoder: Option<Box<dyn ArrayEncoder>> = None;

if version >= LanceFileVersion::V2_1 {
if arrays[0].data_type() == data_type {
Copy link
Contributor Author

@niyue niyue Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace

I definitely think we should bitpack string offsets

If this is desirable, maybe the simplest change is to remove the condition if arrays[0].data_type() == data_type here so that bit packing would be applied. But this condition seems to be added prevously intentionally and I am not sure if this should be simply removed or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants