Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document dense encoding of invalid pushdata in EOFv0 #98

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
/.idea
__pycache__
/corpus
/venv
139 changes: 139 additions & 0 deletions spec/eofv0_verkle.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,145 @@ The same as above except encode the values as 6-bit numbers
(minimum number of bits needed for encoding `32`).
Such encoding lowers the size overhead from 3.1% to 2.3%.

### Encode only invalid jumpdests (dense encoding)

Alternate option is instead of encoding all valid `JUMPDEST` locations, to only encode invalid ones.
axic marked this conversation as resolved.
Show resolved Hide resolved
By invalid `JUMPDEST` we mean a `0x5b` byte in any pushdata.

This is beneficial if our assumption is correct that most contracts only contain a limited number
of offending cases. Our initial analysis of the top 1000 used bytecodes suggests this is the case:
only 0.07% of bytecode bytes are invalid jumpdests.

Let's create a map of `invalid_jumpdests[chunk_index] = first_instruction_offset`. We can densely encode this
map using techniques similar to *run-length encoding* to skip distances and delta-encode indexes.
This map is always fully loaded prior to execution, and so it is important to ensure the encoded
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: see how much of those costs could be covered by the 21000 gas.

version is as dense as possible (without sacrificing on complexity).

We propose the encoding using fixed-size 8-bit elements.
For each entry in `invalid_jumpdests`:
- 1-bit mode (`skip`, `value`)
- For skip-mode:
- 7-bit number of chunks to skip
- For value-mode:
- 7-bit number combining number of chunks to skip `s` and `first_instruction_offset`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next line?

produced as `s * 33 + first_instruction_offset`

For the worst case where each chunk contains an invalid `JUMPDEST` the encoding length is equal
to the number of chunks in the code. I.e. the size overhead is 3.1%.

| code size limit | code chunks | encoding chunks |
|-----------------|-------------|-----------------|
| 24576 | 768 | 24 |
| 32768 | 1024 | 32 |
| 65536 | 2048 | 64 |

Our current hunch is that in average contracts this results in a sub-1% overhead, while the worst case is 3.1%.
This is strictly better than the 3.2% overhead of the current Verkle code chunking.

#### Header location

It is possible to place above as part of the "EOFv0" header, but given the upper bound of number of chunks occupied is low (33 vs 21),
it is also possible to make this part of the Verkle account header.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but if we want to increase the maximum code size to 64k, there won't be enough space left for it in the header.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With scheme 1 it is still 56 verkle leafs for 64k code in worst case. That should still easily fit into the 128 "special" first header leafs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we definitely need a variadic length of this section because the average case (1–2 chunks) is much different from the worst case (20–30 chunks). I.e. you don't want to reserve ~60 chunks in the tree just to use 2 on average.


This second option allows for the simplification of the `code_size` value, as it does not need to change.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "second option", you mean "adding it to the account header", not "Scheme 2", right ?

I don't see why there would be a difference with the other case though : in both cases, one needs to use the code size to skip the header.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "second option", you mean "adding it to the account header", not "Scheme 2", right ?

Yes.

I don't see why there would be a difference with the other case though : in both cases, one needs to use the code size to skip the header.

No because I'd imagine the account header (i.e. not code leafs/keys) would be handled separately, so the actual EVM code remains verbatim.


#### Runtime after Verkle

During execution of a jump two checks must be done in this order:

1. Check if the jump destination is the `JUMPDEST` opcode.
2. Check if the jump destination chunk is in the `invalid_jumpdests` map.
If yes, the jumpdest analysis of the chunk must be performed
to confirm the jump destination is not push data.

It is possible to reconstruct sparse account code prior to execution with all the submitted chunks of the transaction
and perform `JUMPDEST`-validation to build up a relevant *valid `JUMPDEST` locations* map instead.

#### Analysis

We have analyzed two contracts, Arbitrum validator and Uniswap router.

Arbitrum (2147-bytes long):
```
(chunk offset, chunk number, pushdata offset)
malicious push byte: 85 2 21
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This analysis is wrong because we have to encode first instruction offset instead of first invalid jumpdest offset. I think we should remove this section or at least mark is as incorrect until I'll come with proper analysis.

malicious push byte: 95 2 31
malicious push byte: 116 3 20
malicious push byte: 135 4 7
malicious push byte: 216 6 24
malicious push byte: 1334 41 22
```

Encoding with *scheme 1*:
```
[skip, 2]
[value, 21]
[value, 31]
[skip, 1]
[value, 20]
[skip, 1]
[value, 7]
[skip, 2]
[value, 24]
[skip, 35]
[value, 22]
```

Encoding size: `5 skips (5 * 11 bits) + 6 values (6 * 7 bits)` = 13-bytes header (0.605%)

Encoding with *scheme 2*:
```
[skip, 2]
[value, 0, 21]
[value, 0, 31]
[value, 1, 20]
[value, 1, 7]
[value, 2, 24]
[skip, 35, 22]
```

Encoding size: `2 skips (2 * 11 bits) + 5 values (5 * 11 bits)` = 10-bytes header (0.465%)

Uniswap router contract (17958 bytes):
axic marked this conversation as resolved.
Show resolved Hide resolved

```
(chunk offset, chunk number, pushdata offset)
malicious push byte: 1646 51 14
malicious push byte: 1989 62 5
malicious push byte: 4239 132 15
malicious push byte: 4533 141 21
malicious push byte: 7043 220 3
malicious push byte: 8036 251 4
malicious push byte: 8604 268 28
malicious push byte: 12345 385 25
malicious push byte: 15761 492 17
```

Encoding using *scheme 2*:
```
[skip, 51]
[value, 0, 14]
[value, 11, 5]
[skip, 70]
[value, 0, 15]
[value, 9, 21]
[skip, 79]
[value, 0, 3]
[skip, 31]
[value, 0, 4]
[skip, 17]
[value, 0, 28]
[skip, 117]
[value, 0, 25]
[skip, 107]
[value, 0, 17]
```

Encoding size: `7 skips (7 * 11 bits) + 9 values (9 * 11 bits)` = 22-bytes header (0.122%)

Our current hunch is that in average contracts this results in a sub-1% overhead, while the worst case is 4.1%.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good results, although I would like to see a full analysis, including of contracts that are close to the 24kb limit. And, ideally, of contracts with 64kb code size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: we will make a table with worst case values for code size limits of 24k, 32k and 64k.

This compares against the constant 3.2% overhead of the current Verkle code chunking.

## Backwards Compatibility

EOF-packaged code execution if fully compatible with the legacy code execution.
Expand Down