Thesis Phase 1 report
ABSTRACT
The paper , proposes a new architecture of multiplier-and-accumulator (MAC) for high-speed arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance can be improved. Accumulator that has the largest delay in MAC is merged into CSA for improving performance . The proposed CSA tree uses 1s-complement-based radix-2 modified Booths algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of the operands. The CSA propagates the carries to the least significant bits of the partial products and generates the least significant bits in advance to decrease the number of the input bits of the final adder. Also, the proposed MAC accumulates the intermediate results in the type of sum and carry bits instead of the output of the final adder, which make it possible to optimize the pipeline scheme to improve the performance. expect that the proposed MAC can be adapted to various fields requiring high performance such as the signal processing areas.
Thesis Phase 1 report
INTRODUCTION
With the recent rapid advances in multimedia and communication systems, real-time signal processings like audio signal processing, video/image processing, or large-capacity data processing are increasingly being demanded. The multiplier and multiplier-and-accumulator (MAC) are the essential elements of the digital signal processing such as filtering, convolution, and inner products. Most digital signal processing methods use nonlinear functions such as discrete cosine transform (DCT) or discrete wavelet transform (DWT) . Because they are basically accomplished by repetitive application of multiplication and addition, the speed of the multiplication and addition arithmetics determines the execution speed. and performance of the entire calculation. Because the multiplier requires the longest delay among the basic operational blocks in digital system, the critical path is determined by the multiplier, in general. For high-speed multiplication, the modified radix-4 Booths algorithm (MBA) is commonly used.In general, a multiplier uses Booths algorithm and array of full adders (FAs), or Wallace tree instead of the array of FAs., i.e., this multiplier mainly consists of the three parts:Booth encoder, a tree to compress the partial products such asWallace tree, and final adder .The most effective way to increase the speed of a multiplier is to reduce the number of the partial products because multiplication proceeds a series of additions for the partial products. To reduce the number of calculation steps for the partial products, MBA algorithm has been applied mostly where Wallace tree has taken the role of increasing the speed to add the partial products. To increase the speed of the MBA algorithm, many parallel multiplication architectures have been researched . Among them, the architectures based on the BaughWooley algorithm(BWA) have been developed and they have been applied to various digital filtering calculations . One of the most advanced types of MAC for general-purpose digital signal processing has been proposed by Elguibaly . It is an architecture in which accumulation has been combined with the carry save adder (CSA) tree that compresses partial products. In the architecture proposed by Elguibaly , the critical path was reduced by eliminating the adder for accumulation and decreasing the number of input bits in the final adder. While it has a better performance because of the reduced critical path compared to the previous MAC architectures, there is a need to improve the output rate due to the use of the final adder results for accumulation. An architecture to merge the adder block to the accumulator register in the MAC operator was proposed in to provide the possibility of using two separate 2-bit adders instead of one -bit adder to accumulate the bit MAC results
Thesis Phase 1 report
.GENERAL MAC STRUCTURE In this section, we discuss basic MAC operation. Basically, multiplier operation can be divided into three operational steps. The first one is booth encoding to generate the partial products. The second one is adder array or partial product compression and the last one is final addition in which final multiplication result is produced . If the multiplication process is extended to accumulate the multiplied result, then MAC consists of four steps. General hardware architecture for MAC is shown in Figure 1. It executes the multiplication operation by multiplying input multiplier X and input multiplicand Y. After that current multiplication result is added to the previous multiplication result Z as accumulation step
Thesis Phase 1 report
Derivation of MAC Arithmetic
If an operation to multiply two bit numbers and accumulate into a 2 -bit number is considered, the critical path is determined by the 2 -bit accumulation operation. If a pipeline scheme is applied for each step in the standard design of Fig. 1, the delay of the last accumulator must be reduced in order to improve the performance of the MAC. The overall performance of the proposed MAC is improved by eliminating the accumulator itself by combining it with the CSA function. If the accumulator has been eliminated, the critical path is then determined by the final adder in the multiplier. The basic method to improve the performance of the final adder is to decrease the number of input bits. In order to reduce this number of input bits, the multiple partial products are compressed into a sum and a carry by CSA. The number of bits of sums and carries to be transferred to the final adder is reduced by adding the lower bits of sums and carries in advance within the range in which the overall performance will not be degraded. A 2-bit CLA is used to add the lower bits in the CSA. In addition, to increase the output rate when pipelining is applied, the sums and carrys from the CSA are accumulated instead of the outputs from the final adder in the manner that the sum and carry from the CSA in the previous cycle are inputted to CSA. Due to this feedback of both sum and carry, the number of inputs to CSA increases, compared to the standard design and . In order to efficiently solve the increase in the amount of data, a CSA architecture is modified to treat the sign bit
Thesis Phase 1 report
The hardware architecture of the MAC to satisfy the process in Fig. 3 is shown in Fig. 4. The -bitMAC inputs, and , are converted into an -bit partial product by passing through the Booth encoder. In the CSA and accumulator, accumulation is carried out along with the addition of the partial products. As a result, -bit , and (the result from adding the lower bits of the sum and carry) are generated. These three values are fed back and used for the next accumulation. If the final result for the MAC is needed, is generated by adding and in the final adder and combined with that was already generated.
Booths Algorithm
In unsigned multiplication there is no need to take the sign of the number into consideration. However in signed multiplication the same process cannot be applied because the signed number is in a 2s compliment form which would yield an incorrect result if multiplied in a similar fashion to unsigned multiplication. Thats where Booths algorithm comes in. Booths algorithm preserves the sign of the result. The modified Booths algorithm based on a radix-4, generally called Booth-2 is the most popular approachfor implementing fast multipliers using parallel encoding . It uses a digit set {0, 1, 2} to reduce the number of the partial products to n= [(n+1)/ 2]. Radix- 4 encoding start by appending a zero to the right of multiplier LSB. Triplets are taken beginning at position x 1 and continuing to the MSB with one bit overlapping between adjacent tripletsThis recoding scheme applied to a parallel multiplier halves the number of partial products so the multiplication time and the hardware requirements decrease. Radix-8 recoding applies the same algorithm as radix-4, but now in this we take quartets of bits instead of triplets. The Booth-3 scheme is based on a radix-8 encoding to reduce this number to n = [(n+1)/3].
Radix 2 based booth algoritm
For each multiplier bit, also examine bit to its right 00: middle of a run of 0s, do nothing 10: beginning of a run of 1s, subtract multiplicand 11: middle of a run of 1s, do nothing 01: end of a run of 1s, add multiplicand
Thesis Phase 1 report
Example based on radix 2 algorithm:
43 = 00000101011 * 12 = 00000001100 0 = 00000000000 // multiplier bits 00 + 0 = 00000000000 // multiplier bits 00 - 172 = 11101010100 // multiplier bits 10 + 0 = 00000000000 // multiplier bits 11 + 688 = 01010110000 // multiplier bits 01 516 = 01000000100
Radix 4 based booth algorithm
000: middle of run of 0s, do nothing 100: beginning of run of 1s, subtract multiplicand twice 010: singleton 1, add multiplicand once 110: beginning of run of 1s, subtract multiplicand once 001: end of run of 1s, add multiplicand once 101: end of run of 1s, beginning of another, subtract multiplicand once 011: end of a run of 1s, add multiplicand twice 111: middle of run of 1s, do nothing Example based on radix 4 booth algorithm: 43 = 00000101011 * 12 = 00000001100 0 = 00000000000 // multiplier bits 000 - 172 = 11101010100 // multiplier bits 110 + 688 = 01010110000 // multiplier bits 001 516 = 01000000100
Thesis Phase 1 report
WORK DONE SO FAR
Coding for the first section consisting of booth encoder was performed inverilog.Both radix2 and radix - 4 based booth algorithm were programed.Partial products were generated by considering two 8 bit signed numbers.