Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow scrypto_decode::<ScryptoValue> for wasm source code #1708

Open
bbarwik opened this issue Feb 2, 2024 · 4 comments
Open

Slow scrypto_decode::<ScryptoValue> for wasm source code #1708

bbarwik opened this issue Feb 2, 2024 · 4 comments

Comments

@bbarwik
Copy link
Contributor

bbarwik commented Feb 2, 2024

Hey,

I was experimenting with fuzzing but I had a problem, it was slow. I spend some time to find the reason, and I found it.
The problem is this code:

let value: ScryptoValue = scrypto_decode(&value).unwrap();

especially in system_struct_to_node_substates.

It is used with WASM code which is usually huge, sometimes has over 1M elements. This code has Vec which is parsed with:

            ValueKind::Array => {
                let element_value_kind = decoder.read_value_kind()?;
                let length = decoder.read_size()?;
                let mut elements = Vec::with_capacity(if length <= 1024 { length } else { 1024 });
                for _ in 0..length {
                    elements.push(decoder.decode_deeper_body_with_value_kind(element_value_kind)?);
                }
                Ok(Value::Array {
                    element_value_kind,
                    elements,
                })
            }

So decoder.decode_deeper_body_with_value_kind(element_value_kind) is sometimes called even 1M times which takes a lot of time, 50-100 ms. I tried to optimize it using:

            ValueKind::Array => {
                let element_value_kind = decoder.read_value_kind()?;
                match element_value_kind {
                    ValueKind::U8 => {
                        let length = decoder.read_size()?;
                        let slice = decoder.read_slice(length as usize)?;
                        Ok(Value::Array {
                            element_value_kind,
                            elements: slice.iter().map(|v| Value::U8 { value: *v }).collect(),
                        })
                    }

which makes it 4.5x faster but it's not an elegant solution. But it's hard to make such. The Value::U8 uses 48 bytes of memory, probably allocation of 1M * 48 bytes takes a lot of time. So I don't have an idea how to make it faster than that, maybe you'll find some better idea.

I also include a log from perf showing this issue:
image

@iamyulong
Copy link
Member

Thank you @bbarwik for the detailed analysis and report!

Yes, that u8 array decoding seems to be very slow. And the way you tried to improve is also nice. We applied similar optimization for "regular" Vec<u8> decoding.

To further reduce the memory footprint, we will likely need to split Value::Array into Value::U8Array and Value::NonU8Array. This will create a much more compact in-memory representation of bytes. Although, this doesn't seem to be a trivial change, given the wide use of any Value model.

@bbarwik
Copy link
Contributor Author

bbarwik commented Feb 2, 2024

@iamyulong

Adding U8Array to enum would be hard without breaking compatibility.

Also, I made a small mistake, I was benchmarking system_struct_to_node_substates and it was 4.5x faster. When I benchmarked only scrypto_decode it was ~10x faster for 400k elements. I also tried unsafe approach with manual memory allocation:

                        unsafe {
                            let align = mem::align_of::<Value<X, Y>>();
                            let val_size = mem::size_of::<Value<X, Y>>();
                            let layout = std::alloc::Layout::from_size_align(length * val_size, align).unwrap();
                            let raw_memory = std::alloc::alloc(layout) as *mut Value<X, Y>;
                            for i in 0..length as isize {
                                let byte_ptr = raw_memory.offset(i) as *mut u8;
                                ptr::write(byte_ptr, 0x06); // it's U8
                                ptr::write(byte_ptr.offset(1), slice[i as usize]);
                            }
                            Ok(Value::Array {
                                element_value_kind,
                                elements: Vec::from_raw_parts(raw_memory, length, length),
                            })
                        }

and it wasn't faster. So I think that slice.iter().map(|v| Value::U8 { value: *v }).collect() is good enough solution for that case, it should just have some nicer implementation, or maybe even macro to support all from I8 to U128 without huge match, the same should be done for Vec<>, not just Vec<u8>.

The other, smaller problem is when encoding Value::Array, it takes around 1 ms for 400k elements.

            Value::Array {
                element_value_kind,
                elements,
            } => {
                encoder.write_value_kind(*element_value_kind)?;
                encoder.write_size(elements.len())?;
                for item in elements {
                    if item.get_value_kind() != *element_value_kind {
                        return Err(EncodeError::MismatchingArrayElementValueKind {
                            element_value_kind: element_value_kind.as_u8(),
                            actual_value_kind: item.get_value_kind().as_u8(),
                        });
                    }
                    encoder.encode_deeper_body(item)?;
                }
            }

I tested many solutions and this seems to be best one, it's ~4x faster

                    ValueKind::U8 => {
                        let buffer : &mut Vec<u8> = encoder.get_buffer();
                        buffer.reserve(elements.len() + 1024 as usize); // 1024 extra because it's rather not all
                        for item in elements {
                            match item {
                                Value::U8 { value } => {
                                    buffer.push(*value);
                                }
                                _ => {
                                    return Err(EncodeError::MismatchingArrayElementValueKind {
                                        element_value_kind: element_value_kind.as_u8(),
                                        actual_value_kind: item.get_value_kind().as_u8(),
                                    });
                                }
                            }
                        }
                    }

With this 2 optimizations, system_struct_to_node_substates is few times faster than original for simple package.

@iamyulong
Copy link
Member

Sounds great! We will add this to the implementation in due course.

@dhedey
Copy link
Contributor

dhedey commented Jul 23, 2024

Hi @bbarwik - thanks for raising this!

I've only just seen this - but you raise an excellent point, and it's a reason I've been pushing against ScryptoValue's use in the engine at all (and certainly for critical paths with arbitrary payloads).

Instead, I created RawScryptoValue for these cases, which validates the payload, but leaves it "raw" as bytes (and not even copying the bytes if it doesn't need to). So I expect the function would get a very big speed-up, just swapping out ScryptoValue for RawScryptoValue.

We'll also consider special casing U8 decoding for a 4.5x speed up (although will need a slight tweak to get the depth correct - I've been planning a tweak to depth tracking for a while which will work well alongside your proposal).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants