Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Non-Human-Readable Deserialization for Felt Using Serde #76

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jbcaron
Copy link
Collaborator

@jbcaron jbcaron commented Jun 21, 2024

Add Non-Human-Readable Deserialization for Felt Using Serde

Pull Request type

  • Feature

What is the current behavior?

Currently, the Felt type does not support custom serialization and deserialization for bincode and other formats that differentiate between human-readable and binary forms.

What is the new behavior?

  • Implemented Deserialize for Felt to correctly parse hex strings and sequences of 32 bytes, ensuring compatibility with bincode.

Does this introduce a breaking change?

No

Other information

Copy link
Collaborator

@0xLucqs 0xLucqs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a few tests ?

@jbcaron
Copy link
Collaborator Author

jbcaron commented Jun 21, 2024

Can you add a few tests ?

I added deserialization tests

@jbcaron jbcaron requested a review from 0xLucqs June 21, 2024 11:41
Copy link
Collaborator

@0xLucqs 0xLucqs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length of the sequence is always 32? is it enforced somewhere ? if not can you add a test where the length is < 32 and one where it is > 32 ?

@jbcaron
Copy link
Collaborator Author

jbcaron commented Jun 21, 2024

I've implemented error handling for sequence lengths that are not exactly 32 bytes during deserialization. To ensure we provide accurate and efficient error messages, I considered two different approaches. I would appreciate your feedback on which approach is better.

Approach 1:

if let Some(extra) = seq.size_hint() {
    if extra > 0 {
        return Err(de::Error::invalid_length(32 + extra, &"32 bytes"));
    }
}
  • Pros: Uses size_hint to quickly estimate the number of extra elements, potentially reducing the number of iterations.
  • Cons: size_hint might not always be accurate, and its behavior can vary based on the iterator implementation.

Approach 2:

if seq.next_element::<u8>()?.is_some() {
    let mut count = 33;
    while seq.next_element::<u8>()?.is_some() {
        count += 1;
    }
    return Err(de::Error::invalid_length(count, &"32 bytes"));
}
  • Pros: Directly iterates through elements to count extra ones, ensuring accurate length calculation.
  • Cons: Might be less efficient due to additional iterations, especially with many extra elements.

Which approach do you think is more suitable for this case?

@0xLucqs
Copy link
Collaborator

0xLucqs commented Jun 21, 2024

why do we want to enforce 32 bytes ? I think we should allow <= 32. I tend to prefer the 2nd approach as the 1st one is very unreliable

@jbcaron
Copy link
Collaborator Author

jbcaron commented Jun 21, 2024

why do we want to enforce 32 bytes ? I think we should allow <= 32. I tend to prefer the 2nd approach as the 1st one is very unreliable

if we allow <= 32, I think it's better to use little-endian representation instead of big-endian

@0xLucqs
Copy link
Collaborator

0xLucqs commented Jun 21, 2024

What would little endian add here ?

for (i, byte) in bytes.iter_mut().enumerate() {
*byte = seq
.next_element()?
.ok_or_else(|| de::Error::invalid_length(i, &"32 bytes"))?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For extra compactness, can't we accept sequences with fewer elements than 32?
Eg: Felt::ZERO could be serialized as [0], rather than [0; 32].
And all values under 255 on a single element, and so on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are kinda doing this already with the human-readable form. Felt::ZERO is 0x0 not 0x000000000..0000

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I can see @LucasLvy already suggested this.
I agree with him

Token::U8(0),
Token::U8(0),
Token::Seq { len: Some(2) },
Token::U8(1 | 0x80),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this first element containing len and the compressed flag?
can't we just encode Felt::ONE as

&[
Token::Seq { len: Some(1) },
Token::U8(1),
]

And know that because we are using big endian repr, missing bytes are 0s and just be happy with it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serialization can be:

let be_bytes = self.to_bytes_be();
let mut i = match be_bytes.iter().find(|b| b != 0) {
 Some(i) => i,
 None => { // handle Felt::ZERO and return }
}
let mut seq = serializer.serialize_seq(Some(32 - i))?;
while i < 32 {
    seq.serialize_element(be_bytes[i])?;
    i += 1;
}
seq.end();

And deserialization:

let mut buffer = [0; 32];
let mut i = 32;
while i > 0 {
   if let Some(b) = seq.next_element()? {
         buffer[i-1] = b;
   } else {
           break;
   }
  i -= 1;
}
if i == 0 && seq.next_element()?.is_some() {
  // Return error invalid len
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't this way simpler impl work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using seq.next_element()? to infer lengths and handle missing bytes as zeros isn’t reliable for all serializers. Some serializers may not support this implicit handling, leading to incorrect deserialization. Explicitly encoding the length and compressed flag ensures that the deserializer knows exactly how many bytes to expect and process, making the format self-descriptive and robust across different implementations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isn't what we are already doing with Some(32 - i)?

And how exactly could it fail?

Copy link

@cchudant cchudant Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here we had with @jbcaron was to make a binary format that never exceeds 32 bytes in the worse case (abitrary hash), and compresses it down to a few bytes in the best cases (small felt)

To do that, we think of the input as a byte stream and we use the felt unused bits in the first byte to encode a flag telling us if the felt is compressed or not

When the felt is compressed we treat that first byte as the length, so we read len more bytes from the wire -
when it's not compressed, we treat it as the first byte of the hash and complete it by reading 31 mores bytes.

However, we found out that there is an issue: using serde serialize_seq (or serialize_bytes), bincode will prepend the length of the sequence before the felt, and we end up with >32 bytes felts :(
This is not bincode's problem though it's fundamental to how serde works

Taking inspiration from fixed-sized arrays implementation in serde, they are serialized using serialize_tuple, which requires the length to be known beforehand... and we don't know the length before the first byte.

That gave me an idea though - which I tried here https://gist.github.com/cchudant/9bbb025c60a9fe23ca56b895fdcff13c
It works by treating the Felt as a tuple of (u8, <rest>) with <rest> being another tuple that is DeserializeSeeded with the length
Nesting tuples like this has no effect on the actual bytes on the wire at the end, so we are free to do that
The resulting encoding never exceeds 32 bytes as intended and compresses stuff correctly

@tdelabro what do you think of that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I understand better.
I will redo my review keeping this in mind

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, why do you need the flag for?
Get rid of it and if you have 32 elements, it's not compressed. If you have less it is compressed.
The flag doesn't add anything there

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need the byte that contain flag+len only when you serialize [Felt].
Then, because the size in bytes of a Felt can change you have to do:
[flag+len, byte, .., byte, flag+len, byte, flag+len, byte, .., byte, ...]

Copy link
Collaborator

@tdelabro tdelabro Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my recomandation would be:

  1. impl Serde on Felt the way I described
  2. Create public method serialize_sequence_of_felt_packed() and deserialize_sequence_of_felt_packed() that can be called inside #[serde(de/serialize_with = "path")] and implement your self descriptive packed representation here

This way values that are a single felt will always be between 1 and 32 bytes.
Value that are a sequence of felt will be serialized as:

  • by default:
[
  SerdeLen, // Number of felt in sequence
  [SerdeLen, u8, u8, ...], // felt 1
  [SerdeLen, u8, u8, ...], // felt 2
  [SerdeLen, u8, u8, ...], // felt 3
]
  • using our custom methods:
[
  SerdeLen, // Number of felt in sequence
  OurLenAndFlagOrUnflaggedFirstByte // First Byte of the first felt
  u8,
  ...,
  u8, // Last byte of the first felt 
  OurLenAndFlagOrUnflaggedFirstByte // First Byte of the second felt
  ...,
]

This makes you spare one byte in the case of felts that are 32 bytes long and nothing for all the others. Not gonna lie, it's not a huge optimization but may still be worth for some people.

let first_significant_byte = bytes.iter().position(|&b| b != 0).unwrap_or(31);
if first_significant_byte > 1 {
let len = 32 - first_significant_byte;
let mut seq = serializer.serialize_seq(Some(len + 1))?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we are already storing the size. We should call serialize_eq with None in order to spare the extra byte which contains the len at the beginning.
Otherway there is no use to store the len a second time ourself

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes our worst case 33 bytes and our best 3 bytes (value, our len, their len). We want 32 and 2.

let mut seq = serializer.serialize_seq(Some(32))?;
for b in self.to_bytes_be() {
seq.serialize_element(&b)?;
let bytes = self.to_bytes_be();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid the if else nesting of if first_significant_byte > 1 doing the following:

let be_bytes = self.to_bytes_be();
let mut i = match be_bytes.iter().find(|b| b != 0) {
 Some(i) => i,
 None => { // handle Felt::ZERO and return }
}
let mut seq = serializer.serialize_seq(None)?;
if i != 0 {
  seq.serialize_element(32 - i | COMPRESSED)?;
}
while i < 32 {
    seq.serialize_element(be_bytes[i])?;
    i += 1;
}
seq.end();

Token::U8(0),
Token::U8(0),
Token::Seq { len: Some(2) },
Token::U8(1 | 0x80),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, why do you need the flag for?
Get rid of it and if you have 32 elements, it's not compressed. If you have less it is compressed.
The flag doesn't add anything there

@jbcaron jbcaron force-pushed the feat/deserialize branch from 311747f to fd2299f Compare June 27, 2024 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants