ipsilon · axic · May 8, 2024 · Apr 25, 2024 · Apr 25, 2024 · Apr 25, 2024
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
 /.idea
 __pycache__
 /corpus
+/venv
diff --git a/spec/eofv0_verkle.md b/spec/eofv0_verkle.md
@@ -105,6 +105,127 @@ The same as above except encode the values as 6-bit numbers
 (minimum number of bits needed for encoding `32`).
 Such encoding lowers the size overhead from 3.1% to 2.3%.
 
+### Encode only invalid jumpdests (dense encoding)
+
+Alternate option is instead of encoding all valid `JUMPDEST` locations, to only encode invalid ones.
+By invalid `JUMPDEST` we mean a `0x5b` byte in any pushdata.
+
+This is beneficial because most contracts only contain a limited number of offending cases.
+Our initial analysis of the top 1000 bytecodes used in last year confirms this:
+only 0.07% of bytecode bytes are invalid jumpdests.
+
+Let's create a map of `invalid_jumpdests[chunk_index] = first_instruction_offset`. We can densely encode this
+map using techniques similar to *run-length encoding* to skip distances and delta-encode indexes.
+This map is always fully loaded prior to execution, and so it is important to ensure the encoded
+version is as dense as possible (without sacrificing on complexity).
+
+We propose the encoding which uses [VLQ](https://en.wikipedia.org/wiki/Variable-length_quantity):
+
+For each entry `index, first_instruction_offset` in `invalid_jumpdests`:
+
+- Compute the chunk index distance to the previously encoded chunk `delta = index - last_chunk_index - 1`.
+- Combine two numbers into single unsigned integer `entry = delta * 33 + first_instruction_offset`.
+  This is reversible because `first_instruction_offset < 33`.
+- Encode `entry` into sequence of bytes using VLQ (e.g. LEB128). 
+
+For the worst case where each chunk contains an invalid `JUMPDEST` the encoding length is equal
+to the number of chunks in the code. I.e. the size overhead is 3.1%.
+
+| code size limit | code chunks | encoding chunks |
+|-----------------|-------------|-----------------|
+| 24576           | 768         | 24              |
+| 32768           | 1024        | 32              |
+| 65536           | 2048        | 64              |
+
+Our current hunch is that in average contracts this results in ~0.1% overhead, while the worst case is 3.1%.
+This is strictly better than the 3.2% overhead of the current Verkle code chunking.
+
+Stats from "top 1000 bytecodes used in last year":
+
+```
+total code length: 11785831
+total encoding length: 11693 (0.099%)
+encoding chunks distribution:
+0: 109 (10.9%) 
+1: 838 (83.8%)
+2:  49 ( 4.9%)
+3:   4 ( 0.4%)
+```
+
+#### Encoding example
+
+The top used bytecode: [0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2](https://etherscan.io/address/0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2) (WETH).
+
+```
+length: 3124
+chunks: 98
+
+chunks with invalid jumpdests:
+chunk_index  delta  first_instruction_offset  entry  leb128
+37           37      4                        1225   c909
+49           11     12                         375   f702
+50           0      14                          14   0e
+87           36     13                        1201   b109
+```
+
+#### Header location
+
+It is possible to place above as part of the "EOFv0" header, but given the upper bound of number of chunks occupied is low (33 vs 21),
+it is also possible to make this part of the Verkle account header.
+
+This second option allows for the simplification of the `code_size` value, as it does not need to change.
+
+#### Runtime after Verkle
+
+During execution of a jump two checks must be done in this order:
+
+1. Check if the jump destination is the `JUMPDEST` opcode.
+2. Check if the jump destination chunk is in the `invalid_jumpdests` map.
+   If yes, the jumpdest analysis of the chunk must be performed
+   to confirm the jump destination is not push data.
+
+It is possible to reconstruct sparse account code prior to execution with all the submitted chunks of the transaction
+and perform `JUMPDEST`-validation to build up a relevant *valid `JUMPDEST` locations* map instead.
+
+#### Reference encoding implementation
+
+```python
+import leb128
+import io
+
+class VLQM33:
+   VALUE_MOD = 33
+
+   def encode(self, chunks: dict[int, int]) -> tuple[bytes, int]:
+      ops = b''
+      last_chunk_index = 0
+      for index, value in chunks.items():
+         assert 0 <= value < self.VALUE_MOD
+         delta = index - last_chunk_index
+         e = delta * self.VALUE_MOD + value
+         ops += leb128.u.encode(e)
+         last_chunk_index = index + 1
+      return ops, 8 * len(ops)
+
+   def decode(self, ops: bytes) -> dict[int, int]:
+      stream = io.BytesIO(ops)
+      stream.seek(0, 2)
+      end = stream.tell()
+      stream.seek(0, 0)
+
+      m = {}
+      index = 0
+      while stream.tell() != end:
+         e, _ = leb128.u.decode_reader(stream)
+         delta = e // self.VALUE_MOD
+         value = e % self.VALUE_MOD
+         index += delta
+         m[index] = value
+         index += 1
+      return m
+```
+
+
 ## Backwards Compatibility
 
 EOF-packaged code execution if fully compatible with the legacy code execution.