ipsilon · axic · May 8, 2024 · Apr 25, 2024 · Apr 25, 2024 · Apr 25, 2024
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
 /.idea
 __pycache__
 /corpus
+/venv
diff --git a/spec/eofv0_verkle.md b/spec/eofv0_verkle.md
@@ -105,6 +105,145 @@ The same as above except encode the values as 6-bit numbers
 (minimum number of bits needed for encoding `32`).
 Such encoding lowers the size overhead from 3.1% to 2.3%.
 
+### Encode only invalid jumpdests (dense encoding)
+
+Alternate option is instead of encoding all valid `JUMPDEST` locations, to only encode invalid ones.
+By invalid `JUMPDEST` we mean a `0x5b` byte in any pushdata.
+
+This is beneficial if our assumption is correct that most contracts only contain a limited number
+of offending cases. Our initial analysis of the top 1000 used bytecodes suggests this is the case:
+only 0.07% of bytecode bytes are invalid jumpdests.
+
+Let's create a map of `invalid_jumpdests[chunk_index] = first_instruction_offset`. We can densely encode this
+map using techniques similar to *run-length encoding* to skip distances and delta-encode indexes.
+This map is always fully loaded prior to execution, and so it is important to ensure the encoded
+version is as dense as possible (without sacrificing on complexity).
+
+We propose the encoding using fixed-size 8-bit elements.
+For each entry in `invalid_jumpdests`:
+- 1-bit mode (`skip`, `value`)
+- For skip-mode:
+  - 7-bit number of chunks to skip
+- For value-mode:
+  - 7-bit number combining number of chunks to skip `s` and `first_instruction_offset`
+    produced as `s * 33 + first_instruction_offset`
+
+For the worst case where each chunk contains an invalid `JUMPDEST` the encoding length is equal
+to the number of chunks in the code. I.e. the size overhead is 3.1%.
+
+| code size limit | code chunks | encoding chunks |
+|-----------------|-------------|-----------------|
+| 24576           | 768         | 24              |
+| 32768           | 1024        | 32              |
+| 65536           | 2048        | 64              |
+
+Our current hunch is that in average contracts this results in a sub-1% overhead, while the worst case is 3.1%.
+This is strictly better than the 3.2% overhead of the current Verkle code chunking.
+
+#### Header location
+
+It is possible to place above as part of the "EOFv0" header, but given the upper bound of number of chunks occupied is low (33 vs 21),
+it is also possible to make this part of the Verkle account header.
+
+This second option allows for the simplification of the `code_size` value, as it does not need to change.
+
+#### Runtime after Verkle
+
+During execution of a jump two checks must be done in this order:
+
+1. Check if the jump destination is the `JUMPDEST` opcode.
+2. Check if the jump destination chunk is in the `invalid_jumpdests` map.
+   If yes, the jumpdest analysis of the chunk must be performed
+   to confirm the jump destination is not push data.
+
+It is possible to reconstruct sparse account code prior to execution with all the submitted chunks of the transaction
+and perform `JUMPDEST`-validation to build up a relevant *valid `JUMPDEST` locations* map instead.
+
+#### Analysis
+
+We have analyzed two contracts, Arbitrum validator and Uniswap router.
+
+Arbitrum (2147-bytes long):
+```
+(chunk offset, chunk number, pushdata offset)
+malicious push byte: 85 2 21
+malicious push byte: 95 2 31
+malicious push byte: 116 3 20
+malicious push byte: 135 4 7
+malicious push byte: 216 6 24
+malicious push byte: 1334 41 22
+```
+
+Encoding with *scheme 1*:
+```
+[skip, 2]
+[value, 21]
+[value, 31]
+[skip, 1]
+[value, 20]
+[skip, 1]
+[value, 7]
+[skip, 2]
+[value, 24]
+[skip, 35]
+[value, 22]
+```
+
+Encoding size: `5 skips (5 * 11 bits) + 6 values (6 * 7 bits)` = 13-bytes header (0.605%)
+
+Encoding with *scheme 2*:
+```
+[skip, 2]
+[value, 0, 21]
+[value, 0, 31]
+[value, 1, 20]
+[value, 1, 7]
+[value, 2, 24]
+[skip, 35, 22]
+```
+
+Encoding size: `2 skips (2 * 11 bits) + 5 values (5 * 11 bits)` = 10-bytes header (0.465%)
+
+Uniswap router contract (17958 bytes):
+
+```
+(chunk offset, chunk number, pushdata offset)
+malicious push byte: 1646 51 14
+malicious push byte: 1989 62 5
+malicious push byte: 4239 132 15
+malicious push byte: 4533 141 21
+malicious push byte: 7043 220 3
+malicious push byte: 8036 251 4
+malicious push byte: 8604 268 28
+malicious push byte: 12345 385 25
+malicious push byte: 15761 492 17
+```
+
+Encoding using *scheme 2*:
+```
+[skip, 51]
+[value, 0, 14]
+[value, 11, 5]
+[skip, 70]
+[value, 0, 15]
+[value, 9, 21]
+[skip, 79]
+[value, 0, 3]
+[skip, 31]
+[value, 0, 4]
+[skip, 17]
+[value, 0, 28]
+[skip, 117]
+[value, 0, 25]
+[skip, 107]
+[value, 0, 17]
+```
+
+Encoding size: `7 skips (7 * 11 bits) + 9 values (9 * 11 bits)` = 22-bytes header (0.122%)
+
+Our current hunch is that in average contracts this results in a sub-1% overhead, while the worst case is 4.1%.
+This compares against the constant 3.2% overhead of the current Verkle code chunking.
+
 ## Backwards Compatibility
 
 EOF-packaged code execution if fully compatible with the legacy code execution.