Skip to content

Experiment with direct threading for Wasmi's execution #1516

Open
@Robbepop

Description

@Robbepop

Blocked by: #1433

Currently Wasmi uses wasmi_ir's Instruction enum for its simple instruction dispatch:

loop {
match *self.ip.get() {

This is among the simplest solution for an interpreter execution implementation. IIRC this is also called "indirect token threading" since under the hood Wasmi dispatches on the enum discriminant of the Instruction enum which acts as the "token".

However, with an increasing number of instructions this indirect token threading dispatch reaches its limits concerning performance. One explanation for this is that having to maintain the table mapping from token -> instruction-handler does no longer neatly fit in the CPU's L1 cache when there are many hundreds of instructions. Maybe this is also the reason why LLVM does not generate such a branch table and instead uses codegen for Wasmi's instruction dispatch that I genuinely have never succeeded to fully understand.

Wasmi currently has 550 (!simd) and 800 (simd) instructions and according to the theory described above it would benefits extremely from implementing a direct threading instruction dispatch. For this, instead of dispatching on the Instructions's enum discriminant, the Wasmi translator instead translates Instructions without their discriminant and with a function pointer to their instruction handler. This eliminates having to dispatch upon the discriminant during instruction execution.

The most efficient Wasm interpreters such as makepad-stitch and Wasm3 employ this dispatch technique to great success.

Another advantage of using direct threading dispatch in Wasmi is that it would allow to use 8-bytes of instruction parameters per instruction instead of just 6 bytes. This would enable to specialize some f32 and f64 instructions with immediate variants similar to their i32 and i64 counterparts. Similar improvements to instruction encoding could be made across the board. The major obvious downside is that Wasmi's IR (or bytecode) would use significantly (~1.5-2x) more memory since the baseline for instruction sizes is 16 bytes instead of just 8 bytes due to having an up to 64-bit function pointer instead of a 2-byte enum discriminant followed by a minimum 8 bytes of parameters instead of just 6 bytes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    optimizationAn performance optimization issue.researchResearch intense work item.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions