Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staticaly scheduled theta conversion in rhls #552

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ProgrammingLouis
Copy link
Contributor

Staticaly scheduled theta conversion in rhls

THIS IS STILL A WORK IN PROGRESS, the goal of this PR is to present the basic mechanisms and implementation and get feedbacks

The final goal is to have a statically scheduled version of rhls to compare with the existing dynamic version. Furthermore, one could propose a mixed statically/dynamically scheduled harware that benefits from the performance of dynamic hls and from small circuit area of static hls.

Theta nodes are converted into a new jlm::static_hls::loop_node. The jlm::static_hls::loop_node is a structural node that contains 2 subregion :

  1. The control subregion that contains registers, muxes and fsm_node (Final State Machine)
  2. The compute subregion that contains the compute operations

The implemented scheduling algorithm is made to be as simple as possible (and is really not effcient) at this point.
Every node in the original theta subregion goes through the loop_node::add_node method which either

  • Adds the node to the compute subregion, adds a register in the control subregion for each of its outputs and a connects each of tits inputs via new muxes
  • Or if the node operation is already implemented in the compute subregion, adds a register in the control subregion for each of its outputs and adds each inputs in the corresponding mux

A fsm_state is also created for original node.

The fsm.hpp file contains 3 main classes for representing the final state machine :

  • fsm_node_temp : When building the fsm, this stuctural node is created by the fsm_builder and structural outputs are incrementally added to it to connect registers and muxes control inputs
  • fsm_state : fsm_states are regions that represent a state of the fsm. They contain control constants to set the registers store inputs and muxes control inputs. They have their results connected to the structural ouputs of the fsm_node_temp but are not subregions of the fsm_node_temp because they are incrementally added. This is a workaround to be able the incrementally add states to the fsm as well as to add new connected registers and muxes.
  • fsm_builder : This is the main class for building the fsm. When the building is complete the fsm_node_tempis deleted and converted to a gamma (see fsm_node_builder::generate_gamma() and loop_node::finalize())

This PR does not contain conversion to FirRTL for now.

This PR is here to have feedback on this implementation but still misses a lot of things that will be added soon in new commits.
Things that I'm planning on adding as soon as possible :

  • Encoding the next state of the fsm in each state
  • Renaming of classes and method (especially backedge argument and result which are not only used for backedges)
  • Documentation for every class and method
  • A small test case
  • Proper mechanisms to hundle the predicate, initial state and output state

fsm
loop_node

@phate

@sjalander
Copy link
Collaborator

@haved I added you as a reviewer as you should get familiar with most of the jlm code base, and this is a larger contribution to it.

@haved
Copy link
Collaborator

haved commented Jul 18, 2024

@sjalander I agree, I will look at it. I don't quite understand what program/circuit the figures shown in the PR represent. I assume they are the result of conversion, but what did the theta node look like before transformation? Is static HLS written about somewhere? @ProgrammingLouis

@phate
Copy link
Owner

phate commented Jul 19, 2024

@sjalander @ProgrammingLouis I spend some time looking at this now and I have quite a hard time to understand what is going on. From the best of my understanding, here are some high-level comments to start with:

  1. It would be nice to have 1 or 2 simple(!) unit tests where I can see the theta conversion in action. This would enable me to step through the conversion process to better understand what exactly is going on. It would also allow me to be able to concretely map the input to the conversion (theta) to the output of the conversion (static scheduled HLS loop) and compare them.
  2. (Restating what I got from the code): The finite state machine is implemented using a gamma. The reason why the gamma is only created at the end is because you need to have seen all the nodes in order to know all the finite state machine states. In the end, each state is represented by a subregion in the gamma, which has some control constants in it that drive the continuation of the state machine. The way you keep track of these states throughout the conversion is to have a dummy structural node (fsm_node_temp) and dummy regions (fsm_state) , which are both removed again once you create the gamma at the end. This is also my biggest issue with the code right now (from a high-level view): You are currently (mis-)using structural nodes and regions to keep track of information throughout the conversion. The reason why you seem to do this is because in order to create the finite state machine gamma node, you need to know the number of subregions, i.e., the number of states, it is supposed to have, but you cannot know this before you have traversed the graph. Looking at the code, a fsm_state is added: (i) For every input of the original theta and (ii) for every node in the original theta region. Thus, could you not just precompute the number of states before conversion by simply summing up the number of inputs and the number of nodes? If it is not possible predetermine the number of states (even with an extra pass through the RVSDG), I would like you to create your own data structures to keep track of this state throughout the conversion instead of misusing the RVSDG data structures. We do this similarly in other transformations, such as in Dead Node Elimination or Memory State Encoder.
  3. In the image above, you have a turquoise back-edge at the gamma (from a gamma output to a gamma input). I do not see this back-edge realized in the code. Is this a mistake in the drawing or does this exist in the code as well? How is it realized?

@sjalander
Copy link
Collaborator

@phate @haved @ProgrammingLouis @davidmetz
Static scheduling can be compared to with instruction scheduling for a VLIW (very long instruction word) architecture. In a VLIW, you have a given set of operations that can be scheduled in a single cycle. It is the compilers work to select which operations to be included in each cycle and route results to a register file and read these to feed the operations.

In our case, we have a bit more freedom as the architecture is not predefined. So, we can choose how many operations that can be performed in a single cycle, i.e., the number of adds/subs, multiplications, memory operations, etc. We can think of this as a parallel ALU/MEM/BRANCH unit that in a conventional pipeline represents the execute stage. This part is represented by the “compute subregion” of the static_loop node, as shown in the figure.

The next step is to connect the outputs/results of our “execute stage”/“compute subregion” to registers such that we can store temporary variables and multiplexers such that we can feed the “compute subregion” with operands for the next compute cycle.

Finally, we need a finite state machine (FSM) that for each cycle of the loop controls, which registers should be written and which registers/arguments that should be fed to the “compute subregion”

The FSM in its simplest form can be viewed as a straight chain of states, with each state containing the control signals for controlling the multiplexers and registers. This is modeled with a gamma node, which each region representing one state in the FSM.
@phate Regarding (3.) in your comment above: With a straight line of states, the gamma can be controlled by a counter that increments, moving sequentially through all the regions. David made the observation that if one instead output the condition for each region, then one can “branch” to any other state of the FSM, making it a generic FSM.
This functionality is still to be implemented.

@haved
Copy link
Collaborator

haved commented Jul 19, 2024

@sjalander sounds cool! I like the explicit register node, and I assume its second input (red) is either a reset or set control wire?

My only question about the scheme for now is what kind of input the fsm takes, beyond the current state. Does it come from the last "evaluation" of the computation region, being implicitly latched, or does it have to come from one of the register nodes?

@ProgrammingLouis
Copy link
Contributor Author

#552 (comment)
@haved

In the current implementation, the fsm takes the state as input which is connected to a region argument.
This is just connected like this for now but I'm changing that so that it would be like on the schema.
The fsm encodes the next state in each state subregion and additionnaly takes the predicate as second input.

Each register has a red input which corresponds to the store input.

@ProgrammingLouis
Copy link
Contributor Author

#552 (comment)
@phate

For 1. I just added a small test that runs the conversion.
For 2. It is true that at this point the number of states can be determined easily with a first pass, but when implementing more complex scheduling algorithms that will not be the case. What I can do is the gamma to have the maximum possible number of states (subregions) and leave some empty.

@haved
Copy link
Collaborator

haved commented Jul 24, 2024

@ProgrammingLouis that makes sense. I'm trying to understand the fsm structure more generally. Does it always take the previous state + a single predicate, or is that just a side effect of scheduling a theta? (Could it ever take more inputs?)

Is the idea with this scheme eventually to automatically create a "microcode-interpreter" and corresponding "microcode-program" for the theta? Does it make sense to use this scheme for things that are not theta nodes?

(I'm misusing the word "microcode", but the scheme you present here reminds me of the way microinstructions were presented in TDT4160. Using "VLIW" like @sjalander did is more precise)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants