-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More optimized readback stage RTL generation #103
Comments
Many modern compilers implement common subexpression elimination, which effectively makes this a non-issue. I suspect most common simulator and synthesis tools will do this (but unfortunately it is difficult to know for sure). |
Yeah I agree with @jamesrbailey. Common sub-expression elimination is an extremely common optimization that I know is used extensively in all the modern HDL compilers I have used. AMD's Vivado calls it LUT combining, Design Compiler implements this in their opt stage, and similar for other popular toolchains. Unless there is a benchmark you can share that clearly demonstrates a meaningful performance improvement, I would prefer to not modify this as it would add unnecessary complexity to the logic generator templating system. |
Hi,
The "counter_q + 1" was scattered accross the FSM on maybe 30-40 places. The only thing I did, was that I introduced a "counter increment" combo signal as a parallel assignment:
and replaced all the occurences of the "counter_q + 1" within the FSM with "counter_increment". Without changing anything else (constraints, Vivado version, any synthesis options / strategy), I got around 700 LUTs less. The design in total was like 80 K LUTs so a very stupid change saved under 1 % of LUTs. Therefore I always try to write the RTL in such way that it does not duplicate any comparison / arithmetic / AND,OR,XOR ... logic unless it is intended (e.g. some sort of redundancies). I think it is a good approach to try to write the RTL "as-if" schematic, and not rely on various optimizations. Sure the example above was most likely caused by the complexity of the FSM, where the CSE most likely could not locate the common logic due to really enormous decoder of that FSM, but still. I will try to benchmark this in DC some dummy generated register map that has big read-back stage and let you know. |
Hi @amykyta3, sorry for long time without reply. We were chasing tape-out, so I got to it only now. We took one of our blocks, and we replaced ORDT generated register map with PeakRDL as trial. The RDL source and before / after version of the generated RTL: dma_before.sv.txt The synthesis via commercial ASIC synthesizer does differ slightly (first number is total area in um^2): Before:
After:
the diff on this block is very small, about 1% of total area, so not very effective. Still, if the effort to do I can download the Modelsim version that you were referring to, and run the regression suite too, so When I applied these "hand-fixes" and also used async reset I get to :
while ORDT AHB register map gives us something like:
The We use PeakRDL 0.11.0 and reg-block that is packaged with it. The IFC is APB3. The 300 gate difference is most likely OK for us, its just curiosity where does it come from. PS: I was sending you a LinkedIn Invite. Could we have a chat about peakRDL ? |
I am trying to run peakRDL regblok on a block that we are designing. Currently, we are using ORDT,
but since the ORDT is not mainteined anymore, we might be looking for alternatives.
I see that the readback logic is not very optimal, e.g.
I see that on the each stage there is the same AND condition. The problem is that this creates the
same logic multiple times during elaboration, and then also during initial technology mapping.
The logic is then only optimized out due to resource sharing algorithms during the synthesis.
I think this solution is sub-optimal. In my experience, the more you rely on resource sharing or
synthesis optimizations to do their job, the longer the synthesis takes. If we chose peakRDL for
our next ASIC project, and designed 20 blocks with it (1000s of flip flops in 100s of registers),
such simple thing might be actually dozens of extra minutes of synthesis run-time.
I think instead the "common" logic should be abstracted to a temporary signal, e.g.:
This-way directly the RTL reflects the structure of logic as one would draw it on paper.
It might seem a small thing, but for large design this really results in better turn-around times.
The text was updated successfully, but these errors were encountered: