Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document dense encoding of invalid pushdata in EOFv0 #98

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
/.idea
__pycache__
/corpus
/venv
121 changes: 121 additions & 0 deletions spec/eofv0_verkle.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,127 @@ The same as above except encode the values as 6-bit numbers
(minimum number of bits needed for encoding `32`).
Such encoding lowers the size overhead from 3.1% to 2.3%.

### Encode only invalid jumpdests (dense encoding)

Alternate option is instead of encoding all valid `JUMPDEST` locations, to only encode invalid ones.
axic marked this conversation as resolved.
Show resolved Hide resolved
By invalid `JUMPDEST` we mean a `0x5b` byte in any pushdata.

This is beneficial because most contracts only contain a limited number of offending cases.
Our initial analysis of the top 1000 bytecodes used in last year confirms this:
only 0.07% of bytecode bytes are invalid jumpdests.

Let's create a map of `invalid_jumpdests[chunk_index] = first_instruction_offset`. We can densely encode this
map using techniques similar to *run-length encoding* to skip distances and delta-encode indexes.
This map is always fully loaded prior to execution, and so it is important to ensure the encoded
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: see how much of those costs could be covered by the 21000 gas.

version is as dense as possible (without sacrificing on complexity).

We propose the encoding which uses [VLQ](https://en.wikipedia.org/wiki/Variable-length_quantity):

For each entry `index, first_instruction_offset` in `invalid_jumpdests`:

- Compute the chunk index distance to the previously encoded chunk `delta = index - last_chunk_index - 1`.
- Combine two numbers into single unsigned integer `entry = delta * 33 + first_instruction_offset`.
This is reversible because `first_instruction_offset < 33`.
- Encode `entry` into sequence of bytes using VLQ (e.g. LEB128).

For the worst case where each chunk contains an invalid `JUMPDEST` the encoding length is equal
to the number of chunks in the code. I.e. the size overhead is 3.1%.

| code size limit | code chunks | encoding chunks |
|-----------------|-------------|-----------------|
| 24576 | 768 | 24 |
| 32768 | 1024 | 32 |
| 65536 | 2048 | 64 |

Our current hunch is that in average contracts this results in ~0.1% overhead, while the worst case is 3.1%.
This is strictly better than the 3.2% overhead of the current Verkle code chunking.

Stats from "top 1000 bytecodes used in last year":

```
total code length: 11785831
total encoding length: 11693 (0.099%)
encoding chunks distribution:
0: 109 (10.9%)
1: 838 (83.8%)
2: 49 ( 4.9%)
3: 4 ( 0.4%)
```

#### Encoding example

The top used bytecode: [0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2](https://etherscan.io/address/0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2) (WETH).

```
length: 3124
chunks: 98

chunks with invalid jumpdests:
chunk_index delta first_instruction_offset entry leb128
37 37 4 1225 c909
49 11 12 375 f702
50 0 14 14 0e
87 36 13 1201 b109
```

#### Header location

It is possible to place above as part of the "EOFv0" header, but given the upper bound of number of chunks occupied is low (33 vs 21),
it is also possible to make this part of the Verkle account header.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but if we want to increase the maximum code size to 64k, there won't be enough space left for it in the header.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With scheme 1 it is still 56 verkle leafs for 64k code in worst case. That should still easily fit into the 128 "special" first header leafs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we definitely need a variadic length of this section because the average case (1–2 chunks) is much different from the worst case (20–30 chunks). I.e. you don't want to reserve ~60 chunks in the tree just to use 2 on average.


This second option allows for the simplification of the `code_size` value, as it does not need to change.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "second option", you mean "adding it to the account header", not "Scheme 2", right ?

I don't see why there would be a difference with the other case though : in both cases, one needs to use the code size to skip the header.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "second option", you mean "adding it to the account header", not "Scheme 2", right ?

Yes.

I don't see why there would be a difference with the other case though : in both cases, one needs to use the code size to skip the header.

No because I'd imagine the account header (i.e. not code leafs/keys) would be handled separately, so the actual EVM code remains verbatim.


#### Runtime after Verkle

During execution of a jump two checks must be done in this order:

1. Check if the jump destination is the `JUMPDEST` opcode.
2. Check if the jump destination chunk is in the `invalid_jumpdests` map.
If yes, the jumpdest analysis of the chunk must be performed
to confirm the jump destination is not push data.

It is possible to reconstruct sparse account code prior to execution with all the submitted chunks of the transaction
and perform `JUMPDEST`-validation to build up a relevant *valid `JUMPDEST` locations* map instead.

#### Reference encoding implementation

```python
import leb128
import io

class VLQM33:
VALUE_MOD = 33

def encode(self, chunks: dict[int, int]) -> tuple[bytes, int]:
ops = b''
last_chunk_index = 0
for index, value in chunks.items():
assert 0 <= value < self.VALUE_MOD
delta = index - last_chunk_index
e = delta * self.VALUE_MOD + value
ops += leb128.u.encode(e)
last_chunk_index = index + 1
return ops, 8 * len(ops)

def decode(self, ops: bytes) -> dict[int, int]:
stream = io.BytesIO(ops)
stream.seek(0, 2)
end = stream.tell()
stream.seek(0, 0)

m = {}
index = 0
while stream.tell() != end:
e, _ = leb128.u.decode_reader(stream)
delta = e // self.VALUE_MOD
value = e % self.VALUE_MOD
index += delta
m[index] = value
index += 1
return m
```


## Backwards Compatibility

EOF-packaged code execution if fully compatible with the legacy code execution.
Expand Down
Loading