Power has been an important design challenge for quite some time. Leakage power started to grow in 90nm, and by 65nm it became a severe design issue.

We have built many techniques to address leakage, most notably power gating. These techniques are complex and have an impact on the design as a whole. FinFET technologies are seen as a boon to this issue of leakage. There are references that quote that leakage has been reduced by 60% compared to planar transistors. Does this mean that low power is not a focus anymore? Can we abandon all the low power techniques we have built over time?

Not so fast. To answer these questions, let’s look at an example design. Let’s assume that this is in some non-finFET process and has 1 unit of power in leakage. Assume that the dynamic power component is 9 units of power (90% dynamic and 10% leakage). Now assume that the chip is active during 10% of the time and in idle mode the remainder of the time. We can simply do the math and compute the total power as 1.8 units of power. Out of this, power is evenly split between active mode power and idle mode power (0.9 units each). This is the reason for employing leakage reduction techniques at this node.

Now, if the chip were designed in a finFET process, the leakage goes down by 60% to 0.3 units. Due to gate capacitance increase and the voltages not scaling with the process (refer), let us assume that the dynamic power goes up by 25%. Then the dynamic power is 11.25 units of power. If you assume that the activity profile of the chip has not changed, then the total power is about 1.48 units, which is 18% lower power compared to the non-finFET process. If we examine the individual components, the contribution of active power is 76% and the idle power contribution is only 24%. While the idle mode power has gone down considerably by 78% to just 0.36 units, dynamic power has increased by 52% to 1.13 power units.

What are the conclusions from this analysis? Adopting finFET processes is an effective way to reducing power. But instead of throwing away all the low power techniques built from previous processes, the focus should be on mitigating dynamic power as it has become the dominant component of total power. Dynamic power reduction requires deeper understanding of the design and eliminating unnecessary activity that does not contribute to the functionality. Now, who in the project team is best capable of driving this activity?

Originally posted on Semiconductor Engineering.

# Calypto Deep Sequential Analysis Expertise

Calypto was founded in 2002 and our first product was SLEC (Sequential Logical Equivalence Checker). The core Calypto technology is deep sequential analysis. This patented technology has been applied to the verification domain with our SLEC product family. It was then applied to the power analysis and optimization domain with our PowerPro platform.

SLEC formally verifies that two sequentially different designs are functionally equivalent. SLEC relies on deep sequential analysis for complete proof of designs with complex states. SLEC ensures that any micro-architectural changes to the RTL blocks — such as merging or splitting flops, re-encoding states, re-pipelining, retiming, and clock gating — can be exhaustively verified. Our deep sequential analysis capability enables SLEC to also verify a system model written in C++/SystemC against an RTL design despite the sequential differences between them (different throughput and latency).

Analyzing and optimizing power at the RTL can significantly reduce power, but it is challenging, especially when done manually. The widely used practice of inserting, for instance, clock gating may not be very effective without considering the sequential nature of the whole design.

PowerPro achieves superior quality of results (QoR) because it performs an exhaustive deep sequential analysis of the design, finding all the power optimization opportunities for datapath, memories, and registers. The deep sequential analysis process is purely functional and does not rely on stimulus, such as FSDBs or VCDs files. This is the reason why the PowerPro platform can be deployed early in the design cycle even when realistic stimulus are not available for the design. If the designer provides stimulus, then PowerPro will take into account the stimulus only after the deep sequential analysis process; thereby reducing the logical expression to minimize area overhead and maximize power savings.

PowerPro performs an exhaustive deep sequential analysis to identify writes that are either unobservable downstream or have the same value in consecutive cycles. Gating of the first type of redundant writes is called Observability Based Clock Gating and the second type of redundant writes is called Stability Based Clock Gating.

Our PowerAdviser flow relies on deep sequential analysis to provide information for additional power savings that the user can realize through manual changes. By using this flow we have achieved considerable power reduction. Deep sequential analysis can help explore other power saving transformations; such as, flop sharing, flop cloning, reset optimization, and memory caches.

Other existing techniques, such as structural pattern matching, enable forwarding technique, simulation driven, or limited sequential analysis (e.g., register analysis one cycle before and after) do not provide a complete view of the whole design, which is necessary to find all the potential power optimizations.

Written by Jerome Bortolami, Senior Field Application Engineer at Calypto Design Systems.

# Verifying Power Optimized Designs with SLEC and LEC

PowerPro Is an RTL-to-RTL Tool

PowerPro is Calypto’s automated power reduction tool. It reduces power in digital designs by gating the operation of registers and memories. This is done at the RTL. RTL code is read and power-optimized RTL is written.

PowerPro Low Power Platform

PowerPro Uses Deep Sequential Analysis

PowerPro achieves superior quality of results (QoR) — i. e., lower power — than other power optimization tools.  It does this by using deep sequential analysis.

All power optimization tools, including PowerPro, can perform combinational optimization. This is where there is an opportunity to gate a register clock input, based on the combinational logic that is feeding the register’s data input. This method works well, and it does not alter the logic behavior at the register. The problem is that it leaves additional power saving opportunities on the table.

Sequential optimization utilizes opportunities for power savings that span across the register boundaries.  PowerPro performs deep sequential analysis of the design. It can do an exhaustive analysis, finding all possible optimizations. It rejects some of the possible moves based on analysis of area and timing constraints as well as the power consumption of the move’s added logic.

Sequential Optimization Changes the Behavior at Internal FF Outputs

By analyzing across register boundaries, PowerPro implements optimizations that change the functionality at internal registers — but maintains the same functionality at the outputs.

Here is a simple example.  Consider two registers: a sending register (connected to din) and a receiving register (connected to dout).  The receiving register has a feedback multiplexer, such that when the multiplexer is selected, the register output is fed back to its input.  The receiving register does not always “observe” its input, depending on the selected input of the multiplexer.  So when the multiplexer is not selected, the sending register is not observed.  In this case, the multiplexer is replaced with a clock gater (blue).  The clock gater prevents the receiving register from being clocked when the data is not valid, saving power.  This is conventional combinational clock gating.

When the sending register is not observed, PowerPro can gate it to save power by adding a clock gater (red).  This analysis crosses the valid register boundary, so it is sequential clock gating.   The sending register now has different states when gated from the original design, but the output of the receiving register is unchanged.

Normal LEC Will Not Work with Sequential Clock Gating

When we modify a design’s source code, such as when creating RTL and netlists, it is standard design practice to carefully check the resulting new source code.   Many designers check their logic synthesis with a formal logic equivalency checker (LEC).  When modifying a gate-level netlist for a timing ECO, LEC is used to ensure that the design has not been changed functionally.  The LEC tool analyzes two logic designs and checks them for logic equivalency.  It matches each register between the two representations, and it matches the combinational logic at each register data input.

As we have seen, PowerPro changes the functionality at internal registers.  This poses a problem for conventional LEC tools.  PowerPro may add some registers, resulting in the LEC tool having a mismatch.  It also changes the combinational logic functions at some registers.  So a simple LEC tool cannot handle a sequentially optimized design.

SLEC Can and Does Verify Sequential Clock Gating

As stated earlier, it is good design practice to verify the logical equivalence between the original RTL and the PowerPro optimized RTL.  Although a combinational LEC tool cannot match the two designs, a sequential logical equivalence checker can.  Calypto has just such a tool, called SLEC. SLEC is a formal equivalence checking tool that has a number of special capabilities, and checking PowerPro-optimized RTL is one of its features.

What If I Need to Use a (Combinational) LEC Tool?

In general, we recommend using SLEC to check the power-optimized RTL.  Some users may have a requirement to use a specific combinational LEC tool for verification.  There is a workaround to allow this, but there is a limitation.

PowerPro has an optional capability called CG_OVERRIDE.  This command instructs PowerPro to add an override term to enable logic that disables the optimizations.  The control signal becomes a top level port in the optimized design.  When running the combinational LEC tool, assert the CG_OVERRIDE pin, and you should get a match with your combinational LEC tool.  The downside to this is that the combinational LEC tool does not check the enabled logic.  You need a sequential LEC tool — i.e., SLEC — for that.

Written by Rob Eccles, Field Application Engineer at Calypto Design Systems.

# Avoiding Ambiguity in Your C and C++ Code

In last decade we all witnessed significant growth in the use of C++ and C languages for hardware design. The main reason for this trend was to reduce the cost and time of verification. RTL verification is a lengthy process and bugs are often found too late in the design cycle. Another reason (and benefit) is that verification is extremely fast in C++ designs. As C++ is more abstract than traditional RTL languages like VHDL and Verilog, more hardware can be described in a smaller number of lines, and it is therefore easier to maintain. Also, the maturity of today’s HLS tools establishes confidence in getting production quality RTL in a much shorter amount of time compared to handwriting RTL code in Verilog or VHDL.

However, robust verification in the HLS flow will always be a necessity. Simulation of the HLS-generated RTL ensures that the hardware described in C++ code was properly translated to RTL, even though it is test vector dependent and cannot cover all possible scenarios in the limited amount of time available due to tight project schedules. As C++ and RTL exist at totally different abstraction levels, there is a high possibility of language ambiguity, which may result in the RTL behaving differently than what was intended. Hence, it is always best practice to write clean, unambiguous, platform independent C++ code to get the best outcome from HLS tool. Otherwise we cannot eliminate the chances of costly bugs that may appear in a later stage of the design cycle. Here I will briefly touch on some of those ambiguous C and C++ code situations that should be strictly avoided.

One of the most common sources of ambiguous C++ code appears during an array access. In hardware described by C++, an array can be used as temporary storage for data. HLS tools can translate these arrays to a storage element, such as a group-of registers, or map them to physical memories if the array size is large. Usually, users can control the mapping of arrays to registers or memory in the generated RTL. The number of elements in the array determines the depth of the memory or FIFO in the generated RTL. As C or C++ code become larger, it is often possible that array indexes are out-of-bounds while reading and/or writing from an array. Consider the memory read operation below inside a loop:

 for(i=0; i++; i< a*b)    xyz = my_array[i]  // There is possibility i > size of my_array

This array bound read is an undefined behavior, as per the language reference. Hence, the HLS tool will generate hardware according to how it handles the situation. This may lead to unintended hardware and simulation mismatches between the RTL and C++.

Software designers are familiar with the use of memory debugging tools, such as Purify and Valgrind, to detect such conditions; however these tools depend on the specific test vectors. If the test vectors are insufficient to exercise the out-of-bound array access situation, problems will go undetected. Hence a more formal approach is required.

I have seen several users of Calypto’s Sequential equivalence checker tool (known as SLEC) effectively use it to detect such issues before passing the code through the HLS tool. The trick is to use SLEC to verify the C++ design against itself. Normally, if you compare the source C++ design against itself you should not see any falsification. For an array bound read scenario, SLEC introduces an X (unknown) in the design database (see the figure below). When running SLEC, those Xs are propagated to the next mapped register or design output to show falsification (because SLEC shows falsification for unknowns). SLEC also generates a counter-example waveform, which helps engineers pinpoint the issue and fix the code.

The same technique can be used to detect another undesirable code, which is know as a “UMR” (or uninitialized memory read). See the example below. You cannot expect any HLS tool to meaningfully deal with such a condition. Using the same self-checking technique described above, SLEC can show the falsification of X vs X and help detect such conditions.

 int something; if(condition) { } else {    if(conditon_1) {           somthing ++;  // UMR example    } else {    } }

A few more ambiguous coding  examples are: “divide by zero,” “incomplete Switch-Case,” and “illegal shift operation.” All of these can generate unintended RTL behavior, which can introduce bugs in the RTL code. Hence it is better to detect and fix these in the source C++ code itself.

The good news is that the recent progress on formal verification in the C++ domain enables us to perform C++ property checking. The formal property checker tool by itself can find undesirable code in a fraction of the time it takes to find it in RTL code, and it can pinpoint the location of the error code for the engineer. Other than the source C or C++ design, no other inputs are necessary for C formal property checker tools. Users do not need to spend time writing testbenches to detect these scenarios. By following the standard practices and using a formal tool, engineers can establish a robust and errorless flow from C++ to RTL and beyond, which surely boosts the productivity of the chip design process.

Written by Uday Das, Senior Field application engineer at Calypto Design Systems.

# Utilizing P2P Interfaces in SystemC Designs to Verify RTL

Catapult provides designers the ability to explore many different implementations of their C++ or SystemC design in a short period of time. With this ability to rapidly create different implementations, verification becomes a key component in the process.

Catapult provides the SCVerify environment,  which allows one to verify the HLS-synthesized RTL code in the same testbench used to verify the C++ or SystemC model.  Therefore, the testbench environment is a critical part of the overall HLS design experience for generating correct-by-construction RTL.

There are a few key items to consider when developing a testbench for a SystemC design that will be synthesized to RTL by Catapult.  The testbench must be able to interact with the RTL to verify its functionality and performance.

1) The testbench should validate the “correctness of the design”:

As the source is modified you need to know that you haven’t changed the functionality of the design.

2) The testbench should be able to handle the synchronization, or handshaking, when interfacing with the resultant RTL model:

The RTL may not be able to accept a continuous stream of data; it may need to accept the data in bursts; or it may need large gaps between data samples.  The testbench and RTL will thus need some type of handshaking to properly exchange the data.

3) The testbench should not limit the performance of the resulting RTL model:

The testbench should be “elastic” enough to process, accept, and/or provide the data to and from the RTL so that the actual performance is limited by the RTL and not the testbench.

A recent project highlighted the need for having proper synchronization and elasticity in the testbench.  With Catapult, a library of Modular IO elements is provided to give the synchronization that is needed.  We refer to this as the P2P I/O.

Initial SystemC

The initial SystemC design used signals as the means to connect the testbench to the block that would be synthesized. In the SystemC simulation this was sufficient. A snippet of code below shows this. Shown below are the sc_module declaration of the tx_sys, the portion of the testbench where the tx_sys is instantiated, and the sc_signals that are used to interconnect this tx_sys with the rest of the testbench.

However, the use of sc_signals and sc_in/out does not allow for any synchronization to take place. This causes the RTL simulation in SCVerify to fail.  Upon inspection of the waveforms of the tx_data feeding into the tx_sys, we can see that a new input is being produced every clock cycle by the testbench.

The tx_sys code can process a value on tx_data only every 8 clock cycles.  A snippet of code that reads tx_data shows that once a value is read, each bit of that 8-bit value is then stored in an array (that is mapped to a memory).

Realizing there was no synchronization, P2P interfaces were introduced into the design and testbench. These interfaces are not difficult to add to the design; basically, all you do is make a change in the port declaration in the sc_module and the “interconnect” between the modules in the testbench.

In the waveforms below, synchronization between the tx_sys and the testbench can be seen through the use of ready (rdy) and valid (vld) signals that are already present.  The “rdy” and “vld” signals provide the handshaking necessary for the testbench to provide the input data stream (tx_data_i_dat) to the RTL model.  This handshaking, or synchronization, will allow the testbench to provide data only when the RTL is able to use it (or consume it). There is a new input on tx_data every nine clocks.

The testbench and design (tx_sys) now appear to be “talking” to each other properly on the input side.  We now shift to the outputs, the code that generates the tx_i and tx_q outputs is expected to generate an output every clock cycle.   The data for the outputs has been stored in an array (memory), and the code is reading these out to the output ports.

The waveforms show that these outputs are “stuttering” when they were expected to be available every clock cycle.  This stuttering is a result of the testbench not being able to consume the data as fast as tx_sys can produce it.

Adding P2P FIFOs as an Interconnect

To eliminate the stuttering, FIFOs are used to connect the tx_sys to the testbench to allow for some elasticity in the production and consumption of data.  The code snippet below shows the addition of FIFOs.  The template parameters after the type are:

• Depth: since this is in the testbench, the depth can be quite large.
• The other two parameters indicate the sense of the clock and reset.

Looking at the final waveforms after the FIFOs have been added, we see the final performance of the tx_sys RTL model; the testbench is not limiting the performance  and the tx_sys is able to output a sample every clock.

SummarySystemC designs require some effort in creating a testbench that can interface to the resulting, synthesized RTL model.  The addition of P2P interfaces and FIFOs allows the designer to utilize SCVerify and enables the existing testbench to simulate the performance of the original SystemC model as well as the synthesized RTL model.More information on the Modular IO and P2P interfaces can be found in the Catapult C Synthesis SystemC User’s Guide and the ccs_psp.h header file available in the Catapult Installation directory (\$MGC_HOME/shared/include/ccs_p2p directory).

Written by Wayne Marking, Field Applications Engineer at Calypto Design Systems