Experiment with direct threading for Wasmi's execution

Blocked by: https://github.com/wasmi-labs/wasmi/issues/1433

Currently Wasmi uses `wasmi_ir`'s `Instruction` enum for its simple instruction dispatch:
https://github.com/wasmi-labs/wasmi/blob/1a31914028267a036be56c3d3af0e97a2efae924/crates/wasmi/src/engine/executor/instrs.rs#L134-L135

This is among the simplest solution for an interpreter execution implementation. IIRC this is also called "indirect token threading" since under the hood Wasmi dispatches on the `enum` discriminant of the `Instruction` enum which acts as the "token".

However, with an increasing number of instructions this indirect token threading dispatch reaches its limits concerning performance. One explanation for this is that having to maintain the table mapping from `token -> instruction-handler` does no longer neatly fit in the CPU's L1 cache when there are many hundreds of instructions. Maybe this is also the reason why LLVM does _not_ generate such a branch table and instead uses codegen for Wasmi's instruction dispatch that I genuinely have never succeeded to fully understand.

Wasmi currently has 550 (`!simd`) and 800 (`simd`) instructions and according to the theory described above it would benefits extremely from implementing a direct threading instruction dispatch. For this, instead of dispatching on the `Instructions`'s enum discriminant, the Wasmi translator instead translates `Instruction`s without their discriminant and with a function pointer to their instruction handler. This eliminates having to dispatch upon the discriminant during instruction execution.

The most efficient Wasm interpreters such as `makepad-stitch` and Wasm3 employ this dispatch technique to great success.

Another advantage of using direct threading dispatch in Wasmi is that it would allow to use 8-bytes of instruction parameters per instruction instead of just 6 bytes. This would enable to specialize some `f32` and `f64` instructions with immediate variants similar to their `i32` and `i64` counterparts. Similar improvements to instruction encoding could be made across the board. The major obvious downside is that Wasmi's IR (or bytecode) would use significantly (~1.5-2x) more memory since the baseline for instruction sizes is 16 bytes instead of just 8 bytes due to having an up to 64-bit function pointer instead of a 2-byte enum discriminant followed by a minimum 8 bytes of parameters instead of just 6 bytes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experiment with direct threading for Wasmi's execution #1516

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	loop {
	match *self.ip.get() {

Experiment with direct threading for Wasmi's execution #1516

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions