Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of casting DictionaryArray to StringViewArray #5871

Merged
merged 6 commits into from
Jun 13, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions arrow-cast/src/cast/dictionary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -85,10 +85,97 @@ pub(crate) fn dictionary_cast<K: ArrowDictionaryKeyType>(

Ok(new_array)
}
Utf8View => {
// `unpack_dictionary` can handle Utf8View/BinaryView types, but incurs unnecessary data copy of the value buffer.
alamb marked this conversation as resolved.
Show resolved Hide resolved
// we handle it here to avoid the copy.
let dict_array = array
.as_dictionary::<K>()
.downcast_dict::<StringArray>()
.unwrap();

let string_values = dict_array.values();
let value_offsets = string_values.value_offsets();
let value_buffer = string_values.values().clone();

let view_buffer =
view_from_dict_values(value_offsets, &value_buffer, dict_array.keys());

// Safety:
// the buffer is from StringArray which is utf8.
let string_view = unsafe {
StringViewArray::new_unchecked(
view_buffer,
vec![value_buffer],
dict_array.nulls().cloned(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think calling nulls() doesn't handle the case where the dictionary value itself (rather than the key) was null

fn nulls(&self) -> Option<&NullBuffer> {

I think this should call logical_nulls() instead:

fn logical_nulls(&self) -> Option<NullBuffer> {

Also, it would be good to create a test case that covers this too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! added a new test to cover this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(fyi there seem to be no new commits to this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added now!

)
};
Ok(Arc::new(string_view))
}
BinaryView => {
// `unpack_dictionary` can handle Utf8View/BinaryView types, but incurs unnecessary data copy of the value buffer.
// we handle it here to avoid the copy.
let dict_array = array
.as_dictionary::<K>()
.downcast_dict::<BinaryArray>()
.unwrap();

let binary_values = dict_array.values();
let value_offsets = binary_values.value_offsets();
let value_buffer = binary_values.values().clone();

let view_buffer =
view_from_dict_values(value_offsets, &value_buffer, dict_array.keys());
let binary_view = unsafe {
BinaryViewArray::new_unchecked(
view_buffer,
vec![value_buffer],
dict_array.nulls().cloned(),
)
};
Ok(Arc::new(binary_view))
}
_ => unpack_dictionary::<K>(array, to_type, cast_options),
}
}

fn view_from_dict_values<K: ArrowDictionaryKeyType>(
value_offsets: &[i32],
value_buffer: &arrow_buffer::Buffer,
keys: &PrimitiveArray<K>,
) -> ScalarBuffer<u128> {
let mut view_builder = BufferBuilder::<u128>::new(keys.len());
for i in keys.iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential optimization is a separate loop if there are no nulls in keys (so we can avoid the branch)

Another potental idea is to use value_offsets.windows(2) as an iterator to avoid the bounds checks in value_offsets

However, I think we should merge this basic PR in as is, and then add a bencmark and optimize this kenrnel as a follow on PR (if we care). I can file a ticket if @tustvold agrees

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops, I checked in a append_view_unchecked.

But if we do try_append_view, it is even slower than the unpack_dictionary approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also move append_view_unchecked as a follow-up PR and potentially clean up other use cases where ByteViews are manually constructed.

match i {
Some(v) => {
let idx = v.to_usize().unwrap();
let offset = value_offsets[idx];
let end = value_offsets[idx + 1];
let length = end - offset;
let value_buf = &value_buffer[offset as usize..end as usize];

if length <= 12 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating the views directly, what do you think about using try_append_view and try_append_block added in this PR: #5796

Maybe as bonus points you could add a benchmark for this casting operation, and see if adding append_view_unchecked would make any difference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, adding append_view_unchecked improved the performance by 50%

dict to view            time:   [38.116 µs 38.122 µs 38.127 µs]
                        change: [-50.618% -50.455% -50.290%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

I think it justifies adding append_view_unchecked, what do you think?

Also, should I add the benchmark to the repo? It's a bit tricky to setup two versions of the cast implementation..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it justifies adding append_view_unchecked, what do you think?

Makes sense to me

Also, should I add the benchmark to the repo? It's a bit tricky to setup two versions of the cast implementation..

I do think we should add the benchmark to the repo (so we can use it for future optimizations)

In terms of justifying append_view_unsafe I think running the benchmark on a local checkout that calls append_view and then revert and run the same benchmark is fine (which is presumably what you did)

let mut view_buffer = [0; 16];
view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
view_buffer[4..4 + value_buf.len()].copy_from_slice(value_buf);
view_builder.append(u128::from_le_bytes(view_buffer));
} else {
let view = ByteView {
length: length as u32,
prefix: u32::from_le_bytes(value_buf[0..4].try_into().unwrap()),
buffer_index: 0,
offset: offset as u32,
};
view_builder.append(view.into());
}
}
None => {
view_builder.append_n_zeroed(1);
}
}
}
ScalarBuffer::new(view_builder.finish(), 0, keys.len())
}

// Unpack a dictionary where the keys are of type <K> into a flattened array of type to_type
pub(crate) fn unpack_dictionary<K>(
array: &dyn Array,
Expand Down
66 changes: 34 additions & 32 deletions arrow-cast/src/cast/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5203,19 +5203,19 @@ mod tests {
_test_string_to_view::<i64>();
}

const VIEW_TEST_DATA: [Option<&str>; 5] = [
Some("hello"),
Some("world"),
None,
Some("large payload over 12 bytes"),
Some("lulu"),
];

fn _test_string_to_view<O>()
where
O: OffsetSizeTrait,
{
let data = vec![
Some("hello"),
Some("world"),
None,
Some("large payload over 12 bytes"),
Some("lulu"),
];

let string_array = GenericStringArray::<O>::from(data.clone());
let string_array = GenericStringArray::<O>::from_iter(VIEW_TEST_DATA);

assert!(can_cast_types(
string_array.data_type(),
Expand All @@ -5225,7 +5225,7 @@ mod tests {
let string_view_array = cast(&string_array, &DataType::Utf8View).unwrap();
assert_eq!(string_view_array.data_type(), &DataType::Utf8View);

let expect_string_view_array = StringViewArray::from(data);
let expect_string_view_array = StringViewArray::from_iter(VIEW_TEST_DATA);
assert_eq!(string_view_array.as_ref(), &expect_string_view_array);
}

Expand All @@ -5239,15 +5239,7 @@ mod tests {
where
O: OffsetSizeTrait,
{
let data: Vec<Option<&[u8]>> = vec![
Some(b"hello"),
Some(b"world"),
None,
Some(b"large payload over 12 bytes"),
Some(b"lulu"),
];

let binary_array = GenericBinaryArray::<O>::from(data.clone());
let binary_array = GenericBinaryArray::<O>::from_iter(VIEW_TEST_DATA);

assert!(can_cast_types(
binary_array.data_type(),
Expand All @@ -5257,10 +5249,30 @@ mod tests {
let binary_view_array = cast(&binary_array, &DataType::BinaryView).unwrap();
assert_eq!(binary_view_array.data_type(), &DataType::BinaryView);

let expect_binary_view_array = BinaryViewArray::from(data);
let expect_binary_view_array = BinaryViewArray::from_iter(VIEW_TEST_DATA);
assert_eq!(binary_view_array.as_ref(), &expect_binary_view_array);
}

#[test]
fn test_dict_to_view() {
let string_view_array = StringViewArray::from_iter(VIEW_TEST_DATA);
let string_dict_array: DictionaryArray<Int8Type> = VIEW_TEST_DATA.into_iter().collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update this test to have a dictionary array that has:

  1. Repeated use of dictionary values
  2. keys that are not all increasing
  3. Nulls in the values (as well as the keys)

Perhaps what you can do is create a StringArray from VIEW_TEST_DATA and then make create the indexes manually

Like this example https://docs.rs/arrow/latest/arrow/array/struct.DictionaryArray.html#example-from-existing-arrays

Does that make sense?

let expected_string_array_type = string_view_array.data_type();
let casted_string_array = cast(&string_dict_array, expected_string_array_type).unwrap();
assert_eq!(casted_string_array.data_type(), expected_string_array_type);
assert_eq!(casted_string_array.as_ref(), &string_view_array);

let binary_view_array = BinaryViewArray::from_iter(VIEW_TEST_DATA);
let binary_dict_array = string_dict_array.downcast_dict::<StringArray>().unwrap();
let binary_buffer = cast(&binary_dict_array.values(), &DataType::Binary).unwrap();
let binary_dict_array =
DictionaryArray::<Int8Type>::new(binary_dict_array.keys().clone(), binary_buffer);
let expected_binary_array_type = binary_view_array.data_type();
let casted_binary_array = cast(&binary_dict_array, expected_binary_array_type).unwrap();
assert_eq!(casted_binary_array.data_type(), expected_binary_array_type);
assert_eq!(casted_binary_array.as_ref(), &binary_view_array);
}

#[test]
fn test_view_to_string() {
_test_view_to_string::<i32>();
Expand All @@ -5271,24 +5283,15 @@ mod tests {
where
O: OffsetSizeTrait,
{
let data: Vec<Option<&str>> = vec![
Some("hello"),
Some("world"),
None,
Some("large payload over 12 bytes"),
Some("lulu"),
];

let view_array = {
// ["hello", "world", null, "large payload over 12 bytes", "lulu"]
let mut builder = StringViewBuilder::new().with_block_size(8); // multiple buffers.
for s in data.iter() {
for s in VIEW_TEST_DATA.iter() {
builder.append_option(*s);
}
builder.finish()
};

let expected_string_array = GenericStringArray::<O>::from(data);
let expected_string_array = GenericStringArray::<O>::from_iter(VIEW_TEST_DATA);
let expected_type = expected_string_array.data_type();

assert!(can_cast_types(view_array.data_type(), expected_type));
Expand Down Expand Up @@ -5318,7 +5321,6 @@ mod tests {
];

let view_array = {
// ["hello", "world", null, "large payload over 12 bytes", "lulu"]
let mut builder = BinaryViewBuilder::new().with_block_size(8); // multiple buffers.
for s in data.iter() {
builder.append_option(*s);
Expand Down
Loading