Calypto Deep Sequential Analysis Expertise

Calypto was founded in 2002 and our first product was SLEC (Sequential Logical Equivalence Checker). The core Calypto technology is deep sequential analysis. This patented technology has been applied to the verification domain with our SLEC product family. It was then applied to the power analysis and optimization domain with our PowerPro platform.

 

SLEC formally verifies that two sequentially different designs are functionally equivalent. SLEC relies on deep sequential analysis for complete proof of designs with complex states. SLEC ensures that any micro-architectural changes to the RTL blocks — such as merging or splitting flops, re-encoding states, re-pipelining, retiming, and clock gating — can be exhaustively verified. Our deep sequential analysis capability enables SLEC to also verify a system model written in C++/SystemC against an RTL design despite the sequential differences between them (different throughput and latency).

 

Analyzing and optimizing power at the RTL can significantly reduce power, but it is challenging, especially when done manually. The widely used practice of inserting, for instance, clock gating may not be very effective without considering the sequential nature of the whole design.

 

PowerPro achieves superior quality of results (QoR) because it performs an exhaustive deep sequential analysis of the design, finding all the power optimization opportunities for datapath, memories, and registers. The deep sequential analysis process is purely functional and does not rely on stimulus, such as FSDBs or VCDs files. This is the reason why the PowerPro platform can be deployed early in the design cycle even when realistic stimulus are not available for the design. If the designer provides stimulus, then PowerPro will take into account the stimulus only after the deep sequential analysis process; thereby reducing the logical expression to minimize area overhead and maximize power savings.

 

PowerPro performs an exhaustive deep sequential analysis to identify writes that are either unobservable downstream or have the same value in consecutive cycles. Gating of the first type of redundant writes is called Observability Based Clock Gating and the second type of redundant writes is called Stability Based Clock Gating.

 

Our PowerAdviser flow relies on deep sequential analysis to provide information for additional power savings that the user can realize through manual changes. By using this flow we have achieved considerable power reduction. Deep sequential analysis can help explore other power saving transformations; such as, flop sharing, flop cloning, reset optimization, and memory caches.

 

I would recommend reading PowerAdviser: An RTL Power Platform for Interactive Sequential Optimizations, a case study available for download on our webpage.

 

Other existing techniques, such as structural pattern matching, enable forwarding technique, simulation driven, or limited sequential analysis (e.g., register analysis one cycle before and after) do not provide a complete view of the whole design, which is necessary to find all the potential power optimizations.

 

Written by Jerome Bortolami, Senior Field Application Engineer at Calypto Design Systems.

Verifying Power Optimized Designs with SLEC and LEC

PowerPro Is an RTL-to-RTL Tool

 

PowerPro is Calypto’s automated power reduction tool. It reduces power in digital designs by gating the operation of registers and memories. This is done at the RTL. RTL code is read and power-optimized RTL is written.

 

PowerPro Low Power Platform

Power Pro Integrated design flows

 


PowerPro Uses Deep Sequential Analysis

 

PowerPro achieves superior quality of results (QoR) — i. e., lower power — than other power optimization tools.  It does this by using deep sequential analysis.

 

All power optimization tools, including PowerPro, can perform combinational optimization. This is where there is an opportunity to gate a register clock input, based on the combinational logic that is feeding the register’s data input. This method works well, and it does not alter the logic behavior at the register. The problem is that it leaves additional power saving opportunities on the table.

 

Sequential optimization utilizes opportunities for power savings that span across the register boundaries.  PowerPro performs deep sequential analysis of the design. It can do an exhaustive analysis, finding all possible optimizations. It rejects some of the possible moves based on analysis of area and timing constraints as well as the power consumption of the move’s added logic.

 

Sequential Optimization Changes the Behavior at Internal FF Outputs

 

By analyzing across register boundaries, PowerPro implements optimizations that change the functionality at internal registers — but maintains the same functionality at the outputs.

 

PowerProBlogJan

 

Here is a simple example.  Consider two registers: a sending register (connected to din) and a receiving register (connected to dout).  The receiving register has a feedback multiplexer, such that when the multiplexer is selected, the register output is fed back to its input.  The receiving register does not always “observe” its input, depending on the selected input of the multiplexer.  So when the multiplexer is not selected, the sending register is not observed.  In this case, the multiplexer is replaced with a clock gater (blue).  The clock gater prevents the receiving register from being clocked when the data is not valid, saving power.  This is conventional combinational clock gating.

 

When the sending register is not observed, PowerPro can gate it to save power by adding a clock gater (red).  This analysis crosses the valid register boundary, so it is sequential clock gating.   The sending register now has different states when gated from the original design, but the output of the receiving register is unchanged.

 

Normal LEC Will Not Work with Sequential Clock Gating

 

When we modify a design’s source code, such as when creating RTL and netlists, it is standard design practice to carefully check the resulting new source code.   Many designers check their logic synthesis with a formal logic equivalency checker (LEC).  When modifying a gate-level netlist for a timing ECO, LEC is used to ensure that the design has not been changed functionally.  The LEC tool analyzes two logic designs and checks them for logic equivalency.  It matches each register between the two representations, and it matches the combinational logic at each register data input.

 

As we have seen, PowerPro changes the functionality at internal registers.  This poses a problem for conventional LEC tools.  PowerPro may add some registers, resulting in the LEC tool having a mismatch.  It also changes the combinational logic functions at some registers.  So a simple LEC tool cannot handle a sequentially optimized design.

 

SLEC Can and Does Verify Sequential Clock Gating

 

As stated earlier, it is good design practice to verify the logical equivalence between the original RTL and the PowerPro optimized RTL.  Although a combinational LEC tool cannot match the two designs, a sequential logical equivalence checker can.  Calypto has just such a tool, called SLEC. SLEC is a formal equivalence checking tool that has a number of special capabilities, and checking PowerPro-optimized RTL is one of its features.

 

What If I Need to Use a (Combinational) LEC Tool?

 

In general, we recommend using SLEC to check the power-optimized RTL.  Some users may have a requirement to use a specific combinational LEC tool for verification.  There is a workaround to allow this, but there is a limitation.

 

PowerPro has an optional capability called CG_OVERRIDE.  This command instructs PowerPro to add an override term to enable logic that disables the optimizations.  The control signal becomes a top level port in the optimized design.  When running the combinational LEC tool, assert the CG_OVERRIDE pin, and you should get a match with your combinational LEC tool.  The downside to this is that the combinational LEC tool does not check the enabled logic.  You need a sequential LEC tool — i.e., SLEC — for that.

 

Written by Rob Eccles, Field Application Engineer at Calypto Design Systems.

Avoiding Ambiguity in Your C and C++ Code

In last decade we all witnessed significant growth in the use of C++ and C languages for hardware design. The main reason for this trend was to reduce the cost and time of verification. RTL verification is a lengthy process and bugs are often found too late in the design cycle. Another reason (and benefit) is that verification is extremely fast in C++ designs. As C++ is more abstract than traditional RTL languages like VHDL and Verilog, more hardware can be described in a smaller number of lines, and it is therefore easier to maintain. Also, the maturity of today’s HLS tools establishes confidence in getting production quality RTL in a much shorter amount of time compared to handwriting RTL code in Verilog or VHDL.

 

However, robust verification in the HLS flow will always be a necessity. Simulation of the HLS-generated RTL ensures that the hardware described in C++ code was properly translated to RTL, even though it is test vector dependent and cannot cover all possible scenarios in the limited amount of time available due to tight project schedules. As C++ and RTL exist at totally different abstraction levels, there is a high possibility of language ambiguity, which may result in the RTL behaving differently than what was intended. Hence, it is always best practice to write clean, unambiguous, platform independent C++ code to get the best outcome from HLS tool. Otherwise we cannot eliminate the chances of costly bugs that may appear in a later stage of the design cycle. Here I will briefly touch on some of those ambiguous C and C++ code situations that should be strictly avoided.

 

One of the most common sources of ambiguous C++ code appears during an array access. In hardware described by C++, an array can be used as temporary storage for data. HLS tools can translate these arrays to a storage element, such as a group-of registers, or map them to physical memories if the array size is large. Usually, users can control the mapping of arrays to registers or memory in the generated RTL. The number of elements in the array determines the depth of the memory or FIFO in the generated RTL. As C or C++ code become larger, it is often possible that array indexes are out-of-bounds while reading and/or writing from an array. Consider the memory read operation below inside a loop:

 

for(i=0; i++; i< a*b)

   xyz = my_array[i]  // There is possibility i > size of my_array

This array bound read is an undefined behavior, as per the language reference. Hence, the HLS tool will generate hardware according to how it handles the situation. This may lead to unintended hardware and simulation mismatches between the RTL and C++.

 

Software designers are familiar with the use of memory debugging tools, such as Purify and Valgrind, to detect such conditions; however these tools depend on the specific test vectors. If the test vectors are insufficient to exercise the out-of-bound array access situation, problems will go undetected. Hence a more formal approach is required.

 

I have seen several users of Calypto’s Sequential equivalence checker tool (known as SLEC) effectively use it to detect such issues before passing the code through the HLS tool. The trick is to use SLEC to verify the C++ design against itself. Normally, if you compare the source C++ design against itself you should not see any falsification. For an array bound read scenario, SLEC introduces an X (unknown) in the design database (see the figure below). When running SLEC, those Xs are propagated to the next mapped register or design output to show falsification (because SLEC shows falsification for unknowns). SLEC also generates a counter-example waveform, which helps engineers pinpoint the issue and fix the code.

 

SLEC Array

 

The same technique can be used to detect another undesirable code, which is know as a “UMR” (or uninitialized memory read). See the example below. You cannot expect any HLS tool to meaningfully deal with such a condition. Using the same self-checking technique described above, SLEC can show the falsification of X vs X and help detect such conditions.

 

int something;

if(condition) {

} else {

   if(conditon_1) {

          somthing ++;  // UMR example

   } else {

   }

}

 

A few more ambiguous coding  examples are: “divide by zero,” “incomplete Switch-Case,” and “illegal shift operation.” All of these can generate unintended RTL behavior, which can introduce bugs in the RTL code. Hence it is better to detect and fix these in the source C++ code itself.

 

The good news is that the recent progress on formal verification in the C++ domain enables us to perform C++ property checking. The formal property checker tool by itself can find undesirable code in a fraction of the time it takes to find it in RTL code, and it can pinpoint the location of the error code for the engineer. Other than the source C or C++ design, no other inputs are necessary for C formal property checker tools. Users do not need to spend time writing testbenches to detect these scenarios. By following the standard practices and using a formal tool, engineers can establish a robust and errorless flow from C++ to RTL and beyond, which surely boosts the productivity of the chip design process.

 

Written by Uday Das, Senior Field application engineer at Calypto Design Systems.

Utilizing P2P Interfaces in SystemC Designs to Verify RTL

Catapult provides designers the ability to explore many different implementations of their C++ or SystemC design in a short period of time. With this ability to rapidly create different implementations, verification becomes a key component in the process.

 

Catapult provides the SCVerify environment,  which allows one to verify the HLS-synthesized RTL code in the same testbench used to verify the C++ or SystemC model.  Therefore, the testbench environment is a critical part of the overall HLS design experience for generating correct-by-construction RTL.

 

There are a few key items to consider when developing a testbench for a SystemC design that will be synthesized to RTL by Catapult.  The testbench must be able to interact with the RTL to verify its functionality and performance.

 

 

1) The testbench should validate the “correctness of the design”:

As the source is modified you need to know that you haven’t changed the functionality of the design.

 

2) The testbench should be able to handle the synchronization, or handshaking, when interfacing with the resultant RTL model:

The RTL may not be able to accept a continuous stream of data; it may need to accept the data in bursts; or it may need large gaps between data samples.  The testbench and RTL will thus need some type of handshaking to properly exchange the data.

 

3) The testbench should not limit the performance of the resulting RTL model:

The testbench should be “elastic” enough to process, accept, and/or provide the data to and from the RTL so that the actual performance is limited by the RTL and not the testbench.

 

A recent project highlighted the need for having proper synchronization and elasticity in the testbench.  With Catapult, a library of Modular IO elements is provided to give the synchronization that is needed.  We refer to this as the P2P I/O.

 

Initial SystemC

The initial SystemC design used signals as the means to connect the testbench to the block that would be synthesized. In the SystemC simulation this was sufficient. A snippet of code below shows this. Shown below are the sc_module declaration of the tx_sys, the portion of the testbench where the tx_sys is instantiated, and the sc_signals that are used to interconnect this tx_sys with the rest of the testbench.

 

Calypto Blog Table

 

However, the use of sc_signals and sc_in/out does not allow for any synchronization to take place. This causes the RTL simulation in SCVerify to fail.  Upon inspection of the waveforms of the tx_data feeding into the tx_sys, we can see that a new input is being produced every clock cycle by the testbench.

 

waveforms tx_data, calypto

 

The tx_sys code can process a value on tx_data only every 8 clock cycles.  A snippet of code that reads tx_data shows that once a value is read, each bit of that 8-bit value is then stored in an array (that is mapped to a memory).

 

Table Calypto Blog

 

Adding P2P Interfaces for Synchronization

Realizing there was no synchronization, P2P interfaces were introduced into the design and testbench. These interfaces are not difficult to add to the design; basically, all you do is make a change in the port declaration in the sc_module and the “interconnect” between the modules in the testbench.

 

Calypto Blog Table 3

 

In the waveforms below, synchronization between the tx_sys and the testbench can be seen through the use of ready (rdy) and valid (vld) signals that are already present.  The “rdy” and “vld” signals provide the handshaking necessary for the testbench to provide the input data stream (tx_data_i_dat) to the RTL model.  This handshaking, or synchronization, will allow the testbench to provide data only when the RTL is able to use it (or consume it). There is a new input on tx_data every nine clocks.

 

Calypto AE blog

 

The testbench and design (tx_sys) now appear to be “talking” to each other properly on the input side.  We now shift to the outputs, the code that generates the tx_i and tx_q outputs is expected to generate an output every clock cycle.   The data for the outputs has been stored in an array (memory), and the code is reading these out to the output ports.

 

Calypto Blog

 

The waveforms show that these outputs are “stuttering” when they were expected to be available every clock cycle.  This stuttering is a result of the testbench not being able to consume the data as fast as tx_sys can produce it.

 

AE Blog Calypto

 

Adding P2P FIFOs as an Interconnect

To eliminate the stuttering, FIFOs are used to connect the tx_sys to the testbench to allow for some elasticity in the production and consumption of data.  The code snippet below shows the addition of FIFOs.  The template parameters after the type are:

 

  • Depth: since this is in the testbench, the depth can be quite large.
  • The other two parameters indicate the sense of the clock and reset.

 

Calypto Blog Table 5

 

Looking at the final waveforms after the FIFOs have been added, we see the final performance of the tx_sys RTL model; the testbench is not limiting the performance  and the tx_sys is able to output a sample every clock.

 

AE Blog Calypto

 

SummarySystemC designs require some effort in creating a testbench that can interface to the resulting, synthesized RTL model.  The addition of P2P interfaces and FIFOs allows the designer to utilize SCVerify and enables the existing testbench to simulate the performance of the original SystemC model as well as the synthesized RTL model.More information on the Modular IO and P2P interfaces can be found in the Catapult C Synthesis SystemC User’s Guide and the ccs_psp.h header file available in the Catapult Installation directory ($MGC_HOME/shared/include/ccs_p2p directory). 

 

Written by Wayne Marking, Field Applications Engineer at Calypto Design Systems

Designing Large Ratio Sample Rate Converters

In my travels as a Field Application Engineer I often see very similar engineering challenges being addressed at many different customers. While the applications may be different, the need to efficiently implement high quality solutions is constant.

 

One particular sphere of signal processing I have always had an interest in, is implementing efficient decimation and interpolation. Understanding how hardware blocks and their architecture implementations can be combined to change sample rates between elements is sometimes fascinating and rewarding. It also leads to nice demos with pretty waveforms that confirm you’re on the right track and the design actually does something!

 

Decimation and interpolation, whether of a fixed or variable nature, are fairly straightforward in principal. However, one challenge I have consistently seen is where large ratios of decimation and interpolation are called for such that the overall rate change is small, typically less than 2.

 

For example, think of converting a 48.0 KHz Digital Audio Tape sample to a 44.1 KHz Compact Disk recording. The lowest common sample rate that fits both rates is 7.056 MHz. So to get from 48 KHz to 44.1 KHz we would need to interpolate by 147 and then decimate by 160.

 

Theory might dictate that if we limit ourselves to decimation or interpolation by no more than 8x (in order to avoid very large filters with sharp low-pass cut-off), then in order to convert completely accurately, we need to interpolate in multiple stages by 147x and then decimate in multiple stages by 160x.

 

Perhaps we might choose to interpolate by 7x, then 7x, then 3x to get to 7.056 MHz and then decimate by 8x, 5x, and 4x to get to 44.1 KHz. We could also interleave the interpolation and decimation operations in a different order, such that we never go below the 44.1 KHz output sample rate. That might enable us to keep the core processing frequency lower, but we still have the complexity of multiple filter blocks. The combinations and choices of implementation are pretty much endless.

 

So, while those large ratios of 147/160 give us an exact frequency match, which is important in-system, that is potentially going to be a lot of hardware and room for accumulated errors.

 

Such large ratios become untenable when working with higher sample frequencies. If we wish to convert from 48 MHz to 44.1 MHz, the ability to interpolate a continuous signal by a large ratio may be completely unworkable. So, as engineers, we need to compromise and implement different architectures and algorithmic solutions that are elegant and efficient.

 

One approach is to use linear approximation between two input sample points. It’s very cheap to implement in hardware, requiring just a few counters, some arithmetic and three multiplications (two of which are constant multipliers) to do the scaling. It works pretty well for slow moving signals, but as the frequency content of the sampled signal increases, the signal to noise ratio gets worse. Then, the lack of any filtering of the output signal will produce very poor quality results when folding at the Nyquist Frequency occurs.

 

Another approach is to use a mix of “virtual” interpolation, where points are interpolated only when needed, and then linear interpolation applied between two interpolated points. The amount of interpolation needed can be modest in many cases for communication signals. However, it may not be so good for lower sample rate, high bit precision, audio applications. The interpolation filtering will eliminate the unneeded high frequency components from the source and provide what looks like a slower moving signal.
There’s some interesting work on hardware implementations of B-Spline interpolation for rate conversion that I might pursue in my admittedly limited spare time. http://www.ife.ee.ethz.ch/people/kroman/src makes an interesting read.

 

However, I scribbled up an example design and application note detailing the design of a 37/50 rate adaptor that can be applied to any large ratio rate conversion. The appnote and source code example with a simple testbench, can be downloaded here: Modeling and Synthesizing Large Ratio Rate Adapters. (In order to download it you must be registered on our site with “Support Access” level). Perhaps it will help you in your next design.

 

Written by Stuart Clubb, Senior Field Applications Engineer at Calypto Design Systems