Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOFv0 for packaging legacy code in Verkle Trees #58

Merged
merged 5 commits into from
Apr 25, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions spec/eofv0_verkle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# EOFv0 for packaging legacy code in Verkle Trees

The design draft that proposes the use of EOF
for storing code in Verkle Trees.
An alternative to the existing method of executing
31-byte code chunks accompanied by 1 byte of metadata.

## Goal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd summarise the main goal as the following:

Have a unified way of handling chunking (by not having chunking).

With reusing basic EOF constructs, this allows a more simplified verkle implementation supporting both "eof0 legacy" and eof1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Secondary objective: can this result in a better "code-to-data" ratio (by avoiding chunking)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by your term "not having chunking" or "avoiding chunking". The chunking will be still present. Do you mean the chunking scheme with 31-byte code payload and additional metadata byte?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified verkle implementation supporting both "eof0 legacy" and eof1.

I agree "simplifying verkle impl" and "better code-to-data" (to be verified with data!) are the ultimate goals and benefits here, and should be listed in the doc to keep focused on that.

Provide the result of the jumpdest analysis of a deployed code as the EOF section.
During code execution the jumpdest analysis is already available
and the answer to the question "is this jump target valid?" can be looked up
in the section. This allows using 32-byte Verkle Tree code chunks
(instead of 31-byte of code + 1 byte of metadata).

## Specification Draft

1. Put the code in the single *code* EOF section.
2. Use the EOF container format proposed by [EIP-3540](https://eips.ethereum.org/EIPS/eip-3540) with
version 0 and following modifications to "Changes to execution semantics":
1. `CODECOPY`/`CODESIZE`/`EXTCODECOPY`/`EXTCODESIZE`/`EXTCODEHASH` operates on the *code*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on our meeting, this is an important change. This a separate set of semantics to what EOFv1 proposes.

section only.
2. `JUMP`/`JUMPI`/`PC` relates code positions to the *code* section only.
3. Perform the jumpdest analysis of the code at deploy time (during contract creation).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually makes a big semantical change here: we are locking in the EVM version at the time of contract deployment (or verkle transition).

Currently in mainnet we rely on the fact that semantics of contracts can change. This is both a negative (and maybe a positive?)

Example: if a new opcode is introduced, the jumpdest analysis result of a contract may change. This is not the case after this proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not been following 100%, but could we mitigate this:

Example: if a new opcode is introduced, the jumpdest analysis result of a contract may change. This is not the case after this proposal.

by versioning the jumpdest analysis result and updating it on first use after the new opcode was introduced? Similar to how we do "packaging" (point 5), we do "re-packaging".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean -- I mean that if the jumpdest analysis result is locked it at the time of contract creation / verkle transition, then introduction of new legacy opcodes will need to have special considerations for such contracts.

Currently we can do an analysis on chain to see what effect a new opcode may be (does it change semantics of contracts?), but with the addition of the jumpdest table this analysis is different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that when you store the jumpdest analysis result during verkle transition, you can prepend a version 1 to it. If there's a new opcode introduced you would need to re-do the analysis, overwrite the jumpdest result and bump the version. That would happen the next time the code is touched.

Copy link
Member

@axic axic Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is unfeasible. Would need to update the entire chain at every hardfork. This is one of the reasons any transition like verkle or flat-tree proposed earlier hits the roadblock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting topic. Locking jumpdest analysis allows introducing instructions with immediate data. But this only works for deployed code but not for initcode (where analysis runs before execution).

4. Store the result of the jumpdest analysis in the *jumpdest* EOF section as proposed
by [EIP-3690](https://eips.ethereum.org/EIPS/eip-3690),
but the jumpdests encoding changed to bitmap.
5. The packaging process is done for every deployed code during Verkle Tree migration
and also for every contract creation later
(i.e. becomes the part of the consensus forever).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed a special handling for data contracts: one could check if the first byte is a terminating instruction (STOP, REVERT, etc.) or an unassigned instruction.

However, marking it as a data contract based on an unassigned instruction, then the "problem" listed in https://github.com/ipsilon/eof/pull/58/files#r1484267443 comes up. The only "use case" here I can see is that someone deploys a contract with a soon-to-be-introduced instruction, and that will not work. Think about those merge NFTs.

Copy link
Contributor

@gumb0 gumb0 Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens to initcode? Do creation transaction and CREATE/CREATE2 accept the container? I assume you didn't intend it, as it's a big change, so they kinda accept only stripped-down "code section". And deployed container is also not returned from initcode as container. This should be mentioned I think.

I think I personally would prefer not to package it into EOF container and not to frame it as "container" and anything related to EOF at all. Just prepend bytecode with jumptable in code field of the account. None of the instructions change their semantics, we only change how bytecode is fetched from the trie.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fix to Verkle trie structure, it's not like we're inventing yet another EOF to make Verkle compatible with EOFv1.

## Backwards Compatibility

EOF-packaged code execution if fully compatible with the legacy code execution.
This is achieved by prepending the legacy code with EOF header and the section containing
jumpdest metadata. The contents of the code section is identical to the lagacy code.

Moreover, the wrapping process is bidirectional: wrapping can be created from the legacy code
and legacy code extracted from the wrapping without any information loss.
Implementations may consider keeping the legacy code in the database without modifications
and only construct the EOF wrapping when loading the code from the database.

It also can be noted that information in the *jumpdest* section is redundant to the `JUMPDEST`
instructions. However, we cannot remove these instructions from the code because
this would break at least *dynamic* jumps (where we will not be able to adjust their jump targets).
chfast marked this conversation as resolved.
Show resolved Hide resolved

## Rationale

### Jumpdests encoding

Originally, the EIP-3690 proposes to use delta encoding for the elements of the *jumpdest* section.
This should be efficient for an average contract but behaves badly in the worst case
(every instruction in the code is a `JUMPDEST`).
The delta encoding has also another disadvantage for Verkle Tree code chunking:
whole (?) section must be loaded and preprocessed to check a jump target validity.

We propose to use a bitmap to encode jumpdests.
Such encoding does not need pre-processing and provides random access.
This gives constant 12.5% size overhead, but does not have the two mentioned disadvantages.
chfast marked this conversation as resolved.
Show resolved Hide resolved

## Extensions

### Data section
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't make sense for the given example heuristic. What it does is searches for the unreachable code at the end without any valid jumpdests in it. We can achieve the same or better effect by trimming the jumpdest section to the latest bit set.


Let's try to identify a segment of code at the end of the code where a contract stores data.
We require a heuristic that does not generate any false positives.
This arrangement ensures that the instructions inspecting the code
work without modifications on the continuous *code*+*data* area

Having a *data* section makes the *code* section and therefore the *jumpdest* section smaller.

Example heuristic:

1. Decode instructions.
2. Traverse instructions in reverse order.
3. If during traversal a terminating instruction (`STOP`, `INVALID`, etc)
or the code beginning is encountered,
then the *data* section starts just after the current position.
End here.
4. If during traversal a `JUMPDEST` instruction is encountered,
then there is no *data* section.
End here.

### Prove all jump targets are valid

If we can prove that all jump targets in the code are valid,
then there is no need for the *jumpdest* section.

In the solidity generated code all `JUMPI` instructions are "static"
(preceded by a `PUSH` instruction).
Only some `JUMP` instructions are not "static" because they are used to implement
returns from functions.

Erigon project had an analysis tool which was able to prove all jump validity
for 90+% of contracts.

### Super-dense metadata encoding (6-bit numbers)

Follow the original Verkle Tree idea to provide the metadata of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another variant of this could be storing chunk_id => offset map only for those chunks that have non-zero offset ("number of leading pushdata bytes in a chunk"). Intuitively it seems that most of the chunks have 0 there in practice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is actually true, the PUSH instructions are quite frequent. But we can measure this of course.

But the main disadvantage is that you need to preprocess the section to use it.

I did some quick calculations: for the max number of chunks 768 the size of the 5-bit encoded section is 480 bytes. Assuming 3 bytes per entry of the map encoding, this encoding brings savings only if only 20% of chunks have non-zero entry.

"number of leading pushdata bytes in a chunk". However, instead of including
this metadata as a single byte in the chunk itself, place the value as a 6-bit
encoded number in the *metadata* EOF section. This provides the following benefits:

1. The code executes by full 32-byte chunks.
2. The *metadata* overhead is smaller (2.3% instead of 3.2%).
3. The *metadata* lookup is only needed for jumps
(not needed when following through to the next chunk).
Loading