Staticaly scheduled theta conversion in rhls #552

ProgrammingLouis · 2024-07-18T12:59:50Z

Staticaly scheduled theta conversion in rhls

THIS IS STILL A WORK IN PROGRESS, the goal of this PR is to present the basic mechanisms and implementation and get feedbacks

The final goal is to have a statically scheduled version of rhls to compare with the existing dynamic version. Furthermore, one could propose a mixed statically/dynamically scheduled harware that benefits from the performance of dynamic hls and from small circuit area of static hls.

Theta nodes are converted into a new jlm::static_hls::loop_node. The jlm::static_hls::loop_node is a structural node that contains 2 subregion :

The control subregion that contains registers, muxes and fsm_node (Final State Machine)
The compute subregion that contains the compute operations

The implemented scheduling algorithm is made to be as simple as possible (and is really not effcient) at this point.
Every node in the original theta subregion goes through the loop_node::add_node method which either

Adds the node to the compute subregion, adds a register in the control subregion for each of its outputs and a connects each of tits inputs via new muxes
Or if the node operation is already implemented in the compute subregion, adds a register in the control subregion for each of its outputs and adds each inputs in the corresponding mux

A fsm_state is also created for original node.

The fsm.hpp file contains 3 main classes for representing the final state machine :

fsm_node_temp : When building the fsm, this stuctural node is created by the fsm_builder and structural outputs are incrementally added to it to connect registers and muxes control inputs
fsm_state : fsm_states are regions that represent a state of the fsm. They contain control constants to set the registers store inputs and muxes control inputs. They have their results connected to the structural ouputs of the fsm_node_temp but are not subregions of the fsm_node_temp because they are incrementally added. This is a workaround to be able the incrementally add states to the fsm as well as to add new connected registers and muxes.
fsm_builder : This is the main class for building the fsm. When the building is complete the fsm_node_tempis deleted and converted to a gamma (see fsm_node_builder::generate_gamma() and loop_node::finalize())

This PR does not contain conversion to FirRTL for now.

This PR is here to have feedback on this implementation but still misses a lot of things that will be added soon in new commits.
Things that I'm planning on adding as soon as possible :

Encoding the next state of the fsm in each state
Renaming of classes and method (especially backedge argument and result which are not only used for backedges)
Documentation for every class and method
A small test case
Proper mechanisms to hundle the predicate, initial state and output state

@phate

sjalander · 2024-07-18T13:12:33Z

@haved I added you as a reviewer as you should get familiar with most of the jlm code base, and this is a larger contribution to it.

haved · 2024-07-18T20:58:04Z

@sjalander I agree, I will look at it. I don't quite understand what program/circuit the figures shown in the PR represent. I assume they are the result of conversion, but what did the theta node look like before transformation? Is static HLS written about somewhere? @ProgrammingLouis

phate · 2024-07-19T05:10:02Z

@sjalander @ProgrammingLouis I spend some time looking at this now and I have quite a hard time to understand what is going on. From the best of my understanding, here are some high-level comments to start with:

It would be nice to have 1 or 2 simple(!) unit tests where I can see the theta conversion in action. This would enable me to step through the conversion process to better understand what exactly is going on. It would also allow me to be able to concretely map the input to the conversion (theta) to the output of the conversion (static scheduled HLS loop) and compare them.
(Restating what I got from the code): The finite state machine is implemented using a gamma. The reason why the gamma is only created at the end is because you need to have seen all the nodes in order to know all the finite state machine states. In the end, each state is represented by a subregion in the gamma, which has some control constants in it that drive the continuation of the state machine. The way you keep track of these states throughout the conversion is to have a dummy structural node (fsm_node_temp) and dummy regions (fsm_state) , which are both removed again once you create the gamma at the end. This is also my biggest issue with the code right now (from a high-level view): You are currently (mis-)using structural nodes and regions to keep track of information throughout the conversion. The reason why you seem to do this is because in order to create the finite state machine gamma node, you need to know the number of subregions, i.e., the number of states, it is supposed to have, but you cannot know this before you have traversed the graph. Looking at the code, a fsm_state is added: (i) For every input of the original theta and (ii) for every node in the original theta region. Thus, could you not just precompute the number of states before conversion by simply summing up the number of inputs and the number of nodes? If it is not possible predetermine the number of states (even with an extra pass through the RVSDG), I would like you to create your own data structures to keep track of this state throughout the conversion instead of misusing the RVSDG data structures. We do this similarly in other transformations, such as in Dead Node Elimination or Memory State Encoder.
In the image above, you have a turquoise back-edge at the gamma (from a gamma output to a gamma input). I do not see this back-edge realized in the code. Is this a mistake in the drawing or does this exist in the code as well? How is it realized?

sjalander · 2024-07-19T07:19:09Z

@phate @haved @ProgrammingLouis @davidmetz
Static scheduling can be compared to with instruction scheduling for a VLIW (very long instruction word) architecture. In a VLIW, you have a given set of operations that can be scheduled in a single cycle. It is the compilers work to select which operations to be included in each cycle and route results to a register file and read these to feed the operations.

In our case, we have a bit more freedom as the architecture is not predefined. So, we can choose how many operations that can be performed in a single cycle, i.e., the number of adds/subs, multiplications, memory operations, etc. We can think of this as a parallel ALU/MEM/BRANCH unit that in a conventional pipeline represents the execute stage. This part is represented by the “compute subregion” of the static_loop node, as shown in the figure.

The next step is to connect the outputs/results of our “execute stage”/“compute subregion” to registers such that we can store temporary variables and multiplexers such that we can feed the “compute subregion” with operands for the next compute cycle.

Finally, we need a finite state machine (FSM) that for each cycle of the loop controls, which registers should be written and which registers/arguments that should be fed to the “compute subregion”

The FSM in its simplest form can be viewed as a straight chain of states, with each state containing the control signals for controlling the multiplexers and registers. This is modeled with a gamma node, which each region representing one state in the FSM.
@phate Regarding (3.) in your comment above: With a straight line of states, the gamma can be controlled by a counter that increments, moving sequentially through all the regions. David made the observation that if one instead output the condition for each region, then one can “branch” to any other state of the FSM, making it a generic FSM.
This functionality is still to be implemented.

haved · 2024-07-19T09:32:53Z

@sjalander sounds cool! I like the explicit register node, and I assume its second input (red) is either a reset or set control wire?

My only question about the scheme for now is what kind of input the fsm takes, beyond the current state. Does it come from the last "evaluation" of the computation region, being implicitly latched, or does it have to come from one of the register nodes?

ProgrammingLouis · 2024-07-22T12:37:49Z

#552 (comment)
@haved

In the current implementation, the fsm takes the state as input which is connected to a region argument.
This is just connected like this for now but I'm changing that so that it would be like on the schema.
The fsm encodes the next state in each state subregion and additionnaly takes the predicate as second input.

Each register has a red input which corresponds to the store input.

ProgrammingLouis · 2024-07-22T12:43:41Z

#552 (comment)
@phate

For 1. I just added a small test that runs the conversion.
For 2. It is true that at this point the number of states can be determined easily with a first pass, but when implementing more complex scheduling algorithms that will not be the case. What I can do is the gamma to have the maximum possible number of states (subregions) and leave some empty.

haved · 2024-07-24T21:41:30Z

@ProgrammingLouis that makes sense. I'm trying to understand the fsm structure more generally. Does it always take the previous state + a single predicate, or is that just a side effect of scheduling a theta? (Could it ever take more inputs?)

Is the idea with this scheme eventually to automatically create a "microcode-interpreter" and corresponding "microcode-program" for the theta? Does it make sense to use this scheme for things that are not theta nodes?

(I'm misusing the word "microcode", but the scheme you present here reminds me of the way microinstructions were presented in TDT4160. Using "VLIW" like @sjalander did is more precise)

ProgrammingLouis added 4 commits July 16, 2024 18:39

Static hls theta conversion first commit

9d05ff4

Merge remote-tracking branch 'origin/master' into hls-static-min

fd03d1f

Fixes needed for shared_ptr type

69a62e5

Merge remote-tracking branch 'origin/master' into hls-static-min

777c169

sjalander requested review from phate, haved and davidmetz July 18, 2024 13:11

ProgrammingLouis closed this Jul 19, 2024

ProgrammingLouis reopened this Jul 19, 2024

ProgrammingLouis added 2 commits July 22, 2024 14:21

Small test for static hls loop

4eb7e87

Clang formating

61a5b2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staticaly scheduled theta conversion in rhls #552

Staticaly scheduled theta conversion in rhls #552

ProgrammingLouis commented Jul 18, 2024

sjalander commented Jul 18, 2024

haved commented Jul 18, 2024

phate commented Jul 19, 2024 •

edited

Loading

sjalander commented Jul 19, 2024

haved commented Jul 19, 2024

ProgrammingLouis commented Jul 22, 2024

ProgrammingLouis commented Jul 22, 2024

haved commented Jul 24, 2024 •

edited

Loading

Staticaly scheduled theta conversion in rhls #552

Are you sure you want to change the base?

Staticaly scheduled theta conversion in rhls #552

Conversation

ProgrammingLouis commented Jul 18, 2024