-
Notifications
You must be signed in to change notification settings - Fork 4
PJIT Organization
PJIT consists of two major tables: the opcode thread cache (simply 'the cache') and the opcode literal table (simply 'the table').
The cache is constructed as an n-way set associative cache where the n-weight is a compile-time option to be determined during performance testing later. It will use roughly 1/8 to 1/2 all of available system memory, with the final actual size to be determined later -- a large cache might be underutilized with the typical size of executables to be found on the Amiga platform and a cache that is too small may incur too many cache-misses that reduce performance.
Each cache page is separated by a block of return statements to the outer interpreter to handle the variable opcode lengths of the 68000.
The table consists of every opcode permutation as a single, algorithmically generated function. Each function can then assume to know beforehand the sum of all its registers, addressing modes and operations, save for when the so-called extended addressing modes are used or when handling co-processor operations.
PJIT consists of two major processes: the outer interpreter and the inner interpreter.
The inner interpreter is subroutine threaded where each opcode is implemented in the cache as a subroutine call and each operation is terminated with a return statement. Because of branch-prediction and the return-link stack; the overhead of subroutine threading should be negligible and should exceed any other threading model save for traditional JIT.
The original 68000 stream is a combination of opcodes and operands and when returning to the subroutine thread, it's important to not execute operands as instructions. Since modifying the link register would cause a pipeline stall, operands shall instead be replaced with up to three NOPs (for 1-3 operands) or one NOP followed by a branch (for 4 or more operands) with NOPs padding the gap.
The outer interpreter[^1] will find the entry point into the cache and if necessary, invalidate the page when a cache miss occurs. In this case, the cache is preloaded with the address of the opcode lookup function. This function will use the current real program counter to determine what the 68000 program counter ought to be, looks up the opcode in the table and substitutes the cache entry with a call to the opcode itself. It then calls the opcode immediately before returning.
All normal (straight-line) opcodes within the table simply end in a return statement; simply ld pc, lr
. The link register also performs double duty as a way of quickly determining what the original 68K program counter should be.
All non-conditional branching opcodes within the table simply jump to the outer interpreter where the entry point is rechecked against the cache tag before continuing.
All conditional branches within the table will either return (branch not-taken) or jump to the outer interpreter (the branch is taken) as required.
The ARM processor does not handle a branch-to-a-branch condition; it is always preferable to branch to an instruction. With that in mind, it may make sense to increase the size of the cache by two to interleave NOP and BL commands with the return inevitably connecting with the subsequent NOP and not another branch. Since NOP is a "perfect superscalar opcode" (it can always be executed in parallel if there's a slot free), the extra overhead of this should be much smaller than any pipeline stall.
On the 68000, nearly every instruction can set various flags in the condition register and these have to be captured in the emulator. However, if it is clear that the very next instruction resets the same flags that this instruction sets, then a version of the opcode which does not set any flags may be used. This often omits one or two instructions from each and every opcode.
If a branch or jump is guaranteed to be on the same cache page then it's possible to do a direct branch without returning to the outer interpreter. This will incur some branch misprediction performance, but should still be faster.
Some subroutines may be no bigger than the subroutine call to them and this is certainly the case with NOP. With NOP padding, this may be even more true. A simple table of each opcode's size can be crafted which is then used to determine if inlining is possible and how much NOP padding is needed. This is more work for the outer interpreter and might not actually be beneficial -- we'll see.