You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, we've identified a performance bug in the decoder/APU dispatcher. If FPU_*_LAT>1, the current implementation stalls FPU instructions until the previous completes even if there are no data hazards.
For comparison, when FPU_ADDMUL_LAT=1, there are no stalls...
Source
The primary (but likely not only) culprit seems to be that apu_lat (the encoding of the instruction latencies) is only 2 bits wide, and for FPU_*_LAT>1, the encoding for normal FPU instructions is the same as the "max latency/always stall" encoding for DIV/SQRT (2'h3).
Decoder...
Dispatcher...
Comments
The pipeline details page of the User Manual outlines the expected data hazard stalls, but doesn't mention this non-data hazard stall at deeper pipeline depths. This text further down is ambiguous, "Floating-Point instructions are dispatched to the FPU. Following instructions can be executed by the Core as long as they are not FPU ones and there are no Read-After-Write or Write-After-Write data hazard between them and the destination register of the outstanding FPU instruction." It's unclear if this is saying only that the integer pipeline can execute independent instructions while the FPU is active, or also implying that any further FPU instructions will be stalled.
Given that the FPU is pipelined, the core RTL should either be fixed to utilize that pipeline or else the documentation updated to clearly state that only one FPU instruction can be inflight at a time. The latter is critical to know for deeper pipelines since you incur the full latency of each FPU instruction every time unless you can hide enough integer instructions in between which is not usually feasible.
More broadly speaking,
Was there any awareness of or discussions around fully utilizing the FPU pipeline – was this intentionally not implemented? If so, why?
Were there discussion around doing performance verification i.e. verifying that the micro-architecture is not only functionally correct, but also correct to the spec in terms of the timing of different sequences of instructions? If so, what were the factors and conclusions?
The text was updated successfully, but these errors were encountered:
Component:RTL
Hi, we've identified a performance bug in the decoder/APU dispatcher. If FPU_*_LAT>1, the current implementation stalls FPU instructions until the previous completes even if there are no data hazards.
Steps to Reproduce
RTL Configuration
COREV_PULP=0, COREV_CLUSTER=0
FPU=1, ZFINX=0
FPU_ADDMUL_LAT=2
FPU_OTHERS_LAT=2
Software Test
A sequence of independent FMADD (or other FPU) instructions:
Waves
Unnecessary pipeline bubbles:
For comparison, when FPU_ADDMUL_LAT=1, there are no stalls...
Source
The primary (but likely not only) culprit seems to be that apu_lat (the encoding of the instruction latencies) is only 2 bits wide, and for FPU_*_LAT>1, the encoding for normal FPU instructions is the same as the "max latency/always stall" encoding for DIV/SQRT (2'h3).
Decoder...
Dispatcher...
Comments
The pipeline details page of the User Manual outlines the expected data hazard stalls, but doesn't mention this non-data hazard stall at deeper pipeline depths. This text further down is ambiguous, "Floating-Point instructions are dispatched to the FPU. Following instructions can be executed by the Core as long as they are not FPU ones and there are no Read-After-Write or Write-After-Write data hazard between them and the destination register of the outstanding FPU instruction." It's unclear if this is saying only that the integer pipeline can execute independent instructions while the FPU is active, or also implying that any further FPU instructions will be stalled.
Given that the FPU is pipelined, the core RTL should either be fixed to utilize that pipeline or else the documentation updated to clearly state that only one FPU instruction can be inflight at a time. The latter is critical to know for deeper pipelines since you incur the full latency of each FPU instruction every time unless you can hide enough integer instructions in between which is not usually feasible.
More broadly speaking,
The text was updated successfully, but these errors were encountered: