Apr 16, 2020 Design of 4×4-Bit Multiplier VHDL Code. End entity MultiplierVHDL; architecture Behavioral of MultiplierVHDL is begin Result. 006b59bca7 Deschamps/Sutter/Canto Guide to FPGA Implementation of.VHDL Codes of Guide to FPGA Implementation of Algorithms. Both implementation uses the multiplier and squarer.16 bit serial multiplier - EmbDev.netHello Im trying to write the behavioral code for a serial 16 bit multiplier. (A,B: IN stdlogicvector (15 downto 0. Hi, I am fairly new to VHDL. I have a project for school where I need to multiply constants and send the result to an output. Before I write the in depth code I wanted to verify the VHDL multiply operator with a simple example. I assign two constants integer values and then multiply these two.
This example illustrates how to generate HDL code for a symmetrical FIR filter with fully parallel, fully serial, partly serial and cascade-serial architectures for a lowpass filter for an audio filtering application.
Design the Filter
Use an audio sampling rate of 44.1 kHz and a passband edge frequency of 8.0 kHz. Set the allowable peak-to-peak passband ripple to 1 dB and the stopband attenuation to -90 dB. Then, design the filter using fdesign.lowpass, and create the FIR filter System object using the 'equiripple' method with the 'Direct form symmetric' structure.
Quantize the Filter
Assume that the input for the audio filter comes from a 12 bit ADC and output is a 12 bit DAC.
Generate Fully Parallel HDL Code from the Quantized Filter
Starting with the correctly quantized filter, generate VHDL or Verilog code. Create a temporary work directory. After generating the HDL code (selecting VHDL in this case), open the generated VHDL file in the editor by clicking on hyperlink displayed in the command line display messages.
This is the default case and generates a fully parallel architecture. There is a dedicated multiplier for each filter tap in direct form FIR filter structure and one for every two symmetric taps in symmetric FIR structure. This results in a lot of chip area (78 multipliers, in this example). You can implement the filter in a variety of serial architectures to obtain the desired speed/area trade-off. These are illustrated in further sections of this example.
Generate a Test Bench from the Quantized Filter
Generate a VHDL test bench to make sure that the result matches the response you see in MATLAB® exactly. The generated VHDL code and VHDL testbench can be compiled and simulated using a simulator.
Generate DTMF tones to be used as test stimulus for the filter. A DTMF signal consists of the sum of two sinusoids - or tones - with frequencies taken from two mutually exclusive groups. Each pair of tones contains one frequency of the low group (697 Hz, 770 Hz, 852 Hz, 941 Hz) and one frequency of the high group (1209 Hz, 1336 Hz, 1477Hz) and represents a unique symbol. You will generate all the DTMF signals but use one of them (digit 1 here) for test stimulus. This will keep the length of test stimulus to reasonable limit.
Next, let's generate the DTMF tones
Information Regarding Serial Architectures
Serial Multiplier Vhdl Code For Windows 7
Serial architectures present a variety of ways to share the hardware resources at the expense of increasing the clock rate with respect to the sample rate. In FIR filters, we will share the multipliers between the inputs of each serial partition. This will have an effect of increasing the clock rate by a factor known as folding factor.
You can use hdlfilterserialinfo function to get information regarding various filter lengths based on the value of coefficients. This function also displays an exhaustive table of possible options to specify SerialPartition property with corresponding values of folding factor and number of multipliers.
You can use the optional properties 'Multipliers' and 'FoldingFactor' to display the specific information.
Fully Serial Architecture
In fully serial architecture, instead of having a dedicated multiplier for each tap, the input sample for each tap is selected serially and is multiplied with the corresponding coefficient. For symmetric (and antisymmetrical) structures the input samples corresponding to each set of symmetric taps are preadded (for symmetric) or pre-subtracted (for anti-symmetric) before multiplication with the corresponding coefficients. The product is accumulated sequentially using a register and the final result is stored in a register before the next set of input samples arrive. This implementation needs a clock rate that is as many times faster than input sample rate as the number of products to be computed. This results in reducing the required chip area as the implementation involves just one multiplier with a few additional logic elements like multiplexers and registers. The clock rate will be 78 times the input sample rate (foldingfactor of 78) equal to 3.4398 MHz for this example.
4 Bit Multiplier Vhdl
To implement fully serial architecture, use hdlfilterserialinfo function and set its 'Multipliers' property to 1. You can also set the 'SerialPartition' property with its value equal to the effective filter length, which in this case is 78. The function also returns the folding factor and number of multipliers used for that serial partition setting.
Generate the testbench the same way, as in the fully parallel case. It is important to generate a testbench again for each architecture implementation.
Partly Serial Architecture
Fully parallel and fully serial represent two extremes of implementations. While Fully serial is very low area, it inherently needs a faster clock rate to operate. Fully parallel takes a lot of chip area but has very good performance. Partly serial architecture covers all the cases that lie between these two extremes.
The input taps are divided into sets. Each set is processed in parallel by a serial partition consisting of multiply accumulate and a multiplexer. Here, a set of serial partitions process a given set of taps. These serial partitions operate in parallel with respect to each other but process each tap sequentially to accumulate the result corresponding to the taps served. Finally, the result of each serial partition is added together using adders.
Partly Serial Architecture for Resource Constraint
Let us assume that you want to implement this filter on an FPGA which has only 4 multipliers available for the filter. You can implement the filter using 4 serial partitions, each using one multiply accumulate circuit.
The input taps that are processed by these serial partitions will be [20 20 20 18]. You will specify SerialPartition with this vector indicating the decomposition of taps for serial partitions. The clock rate is determined by the largest element of this vector. In this case the clock rate will be 20 times the input sample rate, 0.882 MHz.
Partly Serial Architecture for Speed Constraint
Assume that you have a constraint on the clock rate for filter implementation and the maximum clock frequency is 2 MHz. This means that the clock rate can't be more than 45 times the input sample rate. For such a design constraint, the 'SerialPartition' should be specified with [45 33]. Note that this results in an additional serial partition hardware, implying additional circuitry to multiply-accumulate 33 taps. You can specify SerialPartition using hdlfilterserialinfo and its property 'Foldingfactor' as follows.
In general, you can specify any arbitrary decomposition of taps for serial partitions depending on other constraints. The only requirement is that the sum of elements of the vector should be equal the effective filter length.
Cascade-Serial Architecture
The accumulators in serial partitions can be re-used to add the result of the next serial partition. This is possible if the number of taps being processed by one serial partition must be more than that by serial partition next to it by at least 1. The advantage of this technique is that the set of adders required to add the result of all serial partitions are removed. However, this increases the clock rate by 1, as an additional clock cycle is required to complete the additional accumulation step.
Cascade-Serial architecture can be specified using the property 'ReuseAccum'. This can be done in two ways.
Add 'ReuseAccum' to generatehdl method and specify it as 'on'. Note that the value specified for 'SerialPartition' property has to be such that the accumulator reuse is feasible. The elements of the vector must be in descending order except for the last two which can be same.
If the property 'SerialPartition' is not specified and 'ReuseAccum' is specified as 'on', the decomposition of taps for serial partitions is determined internally. This is done to minimize the clock rate and to reuse the accumulator. For this audio filter, it is [12 11 10 9 8 7 6 5 4 3 3]. Note that it uses 11 serial partitions, implying 11 multiply accumulate circuits. The clock rate will be 13 times the input sample rate, 573.3 kHz.
Optimal decomposition into as many serial partitions required for minimum clock rate possible for reusing accumulator.
Conclusion
You designed a lowpass direct form symmetric FIR filter to meet the given specification. You then quantized and checked your design. You generated VHDL code for fully parallel, fully serial, partly serial and cascade-serial architectures. You generated a VHDL test bench using a DTMF tone for one of the architectures.
You can use an HDL Simulator to verify the generated HDL code for different serial architectures. You can use a synthesis tool to compare the area and speed of these architectures. You can also experiment with and generating Verilog code and test benches.
In the modern FPGA, the multiplication operation is implemented using a dedicated hardware resource. Such dedicated hardware resource generally implements 18×18 multiply and accumulate function that can be used for efficient implementation of complex DSP algorithms such as finite impulse response (FIR) filters, infinite impulse response (IIR) filters, and fast fourier transform (FFT) for filtering and image processing applications etc.
The Multiplier-Accumulator blocks has a built-in multiplier and adder, which minimizes the fabric logic required to implement multiplication, multiply-add, and multiply-accumulate (MACC) functions. Implementation of these arithmetic functions results in efficient resource usage and improved performance for DSP applications. In addition to the basic MACC function, DSP algorithms typically need small amounts of RAM for coefficients and larger RAMs for data storage.
The main features multiply-accumulate block are mainly:
- Supports 18 × 18 signed multiplication natively.
- Supports 17 × 17 unsigned multiplications.
- Built-in addition, subtraction, and accumulation units to combine multiplication results efficiently.
- Independent third input C with data width completely registered.
- Single-bit input, CARRYIN, from fabric routing.
Of course, these features depend on the technology we are using. Above are described the features that are mainly implemented on every modern FPGA such Altera Stratix, Xilinx Virtex, Microsemi Smartfusion and so on. Each of these technologies customize the implementation adding different features depending on the type of FPGA you are using.
You can, of course, instantiate multiply-accumulate macros using the core generator of the vendor as follow:
Generate the VHDL/Verilog code for the component
Instantiate the component in your VHDL/Verilog code.
Another solution to use the multiplier hardware macro is to activate the instantiation writing the proper VHDL/Verilog code that triggers the synthesizer the instantiation of hardware macro.
Here below you can find the template for NxM signed multiplier. If you want to trigger the hardware macro in an efficient way N and M shall be less than the maximum number of bit handled by the hardware multiplier macro. For instance, if the hardware multiplier can work up to 18×18, you can set N and M to be less or equal to 18 bit. Using more than 18 bit the implementation will use more than a multiplier block slowing down the timing performances.
When you multiply two number with N and M bit, the result will be of N+M bit. For instance, if the first multiplicand has 14 bit and the second multiplicand has 13 bit, the result of the multiplication will have 13+14 = 27 bit.
The example below will implement a 13 x 14 bit signed multiplier.
VHDL Code for 13×14 signed multiplier
To understand what are the hardware resources relative to the 13×14 multiplier implementation let’s layout the code on a Cyclone IV FPGA EP4CGX30. This FPGA has 80 multiplier 18×18 that can be used as 160 multiplier 9×9. The synthesizer maps the best hardware macro implementation for your design.
Figure 1 reports the hardware resources available for the different version of Cyclone IV FPGA. The multiplier has been realized using 2 embedded 9×9 multiplier, as reported on Area report of Figure 2. In the timing report, the maximum clock frequency is 286 MHz.
In Figure 3 is reported the Area and Timing report for a 16×16 multiplier. As clear the hardware resource for the multipliers are the same of the 13×14 multiplier since the VHDL code trigger the same multiplier hardware macro 18×18.
Implementation of pipelined 16×16 multiplier
Taking as reference the previous example, now we are going to implement the pipelined version of the 16×16 multiplier. Let’s refresh the multiplication algorithm: a 16 bit operand can be written as follow:
So, if we want to multiply two 16-bit operands:
The hardware structure for such implementation is reported in Figure 4:
A possible VHDL implementation of Figure 4 is reported below.
VHDL code for a 16×16 multiplier
Figure 5 reports the area and timing report of VHDL code implementation on Altera Cyclone IV used before. In this case, we are using 4 multipliers 9×9 instead of two and both logic and register. The timing report is quite half the previous and the adder and register are implemented using logic. In the previous example reported in Figure 3 the register and logic was zero since the VHDL code of the multiplier was mapped in the hardware multiplier resource present into the FPGA.
Starting from the architecture of Figure 4 the pipeline implementation is straightforward. To implement the pipeline into the multiplier architecture, we need to introduce registers to break the combinatorial path of multiplication and addition and compensate the delay added by these registers.
A possible implementation is given in Figure 6. Here the latency of the multiplier is 6 clock cycles.
Starting from the architecture of the pipelined multiplier of Figure 6 a possible VHDL implementation of a signed pipeline multiplier is:
VHDL code for 16×16 pipeline multiplier
The implementation of the VHDL code of the pipeline multiplier reports the timing characteristic like the implementation of Figure 3 while the area is greater than the implementation of Figure 3.
We can conclude that, for FPGA implementation, if we are using the hardware macro implementation for a multiplier, generally this is the best solution. In ASIC implementation, the pipeline approach can give good result since the same implementation on FPGA report timing performance similar to the optimal implementation of the hardware macro inside the FPGA. In the next section, we will discover a condition where the pipeline multiplier can be used in FPGA too.
Take advantage of pipeline multiplier
In the previous section, we learned how to pipeline a multiplier. The VHDL of the pipeline multiplier has been evaluated on a Cyclone IV FPGA. As clear from area and timing report there are no significant advantages in writing the pipelined version of a multiplier since we are using the basic multiplier macro primitive. As clear from Figure 3 and Figure 7 the timing is similar while the area is slightly greater than the version without a pipeline. Now we are going to see an example where the pipeline multiplier has better performances than the no pipelined one. Suppose you need a multiplier 35×35. The FPGA we are using supports natively 18×18, so you need to use 2 multiplier 18×18 in order to implement a multiplier 35×35.
The VHDL code for a 34×34 multiplier is:
VHDL code for 34×34 multiplier
Using a Cyclone IV FPGA the area and timing report are in Figure 8
In this case, if we want to pipeline the multiply, we can re-write the multiply as:
Following the same rules of the previous section the hardware architecture of the multiplier will be:
The VHDL code for the 34×34 pipelined multiplier can be written following the architecture of Figure 9:
VHDL code for 34×34 pipeline multiplier
Mapping this VHDL code on the Cyclone IV used in the no-pipeline example the results are:
As clear in this case, the pipeline implementation gives an important speed-up in terms of timing, doubling the performances. Conclusion In this post, we considered the VHDL implementation of a multiplier. If we are using an FPGA we should check if the silicon provides the hardware macro for the multiplier. If case the multiplier are present into FPGA as dedicated silicon, we can use them directly using the “*” operator. We must pay attention to the maximum width in
We must pay attention to the maximum width in bit for the operand, if we use operand with the number of the bit less than or equal to the maximum number provided by the hardware macro the VHDL code will be mapped directly into the hardware multiplier as seen in the previous example. In case we need more bit than the maximum number provided we could have to use a pipeline approach. The examples provided above with the Cyclone IV FPGA, but the same consideration can be extended to the others technologies, show that since the number of bit of the operand is less than or equal to 18, no benefit is present in pipeline implementation as summarized in Figure 11.
In the case of multiplier 35×35 the pipeline implementation has a great impact on performances in terms of speed as summarized. In this case is clear that despite an increment of register due to the pipeline, the timing performances increment is 2X.
If you appreciated this post, please help us to share it to your friend.
If you need to contact us, please write to: [email protected]
We appreciate any of your comment, please post below: