# Introduction

iscrete wavelet transforms (DWT) decomposes image into multiple subbands of low and high frequency components. Encoding of subband components leads to compression of image. DWT along with encoding technique represents image information with less number of bits achieving image compression. Image compression finds application in every discipline such as entertainment, medical, defense, commercial and industrial domains. The core of image compression unit is DWT. Other image processing techniques such as image enhancement, image restoration and image filtering also requires DWT and Inverse DWT for transformations. DWT-IDWT is one of the prominent transformation techniques that are widely used in signal processing and communication applications. DWT-IDWT computes or transforms signal into multiple resolution sub bands [1][2][3][4] [5]. DWT is computationally very intensive and consumes power due to large number of mathematical operations. Latency and throughput are other major limitations of DWT as there are multiple levels of hierarchy [6][7] [8]. DWT has traditionally been implemented by convolution. Digit serial or parallel representation of input data further decides the architecture complexity. Such an implementation demands a large number of computations and a large storage that are not desirable for either high-speed or low-power applications. Recently, a lifting-based scheme that often requires far fewer computations has been proposed for the DWT. The main feature of the lifting based DWT scheme is to break up the high pass and low pass filters into a sequence of upper and lower triangular matrices and convert the filter implementation into banded matrix multiplications.

Since DWT requires intensive computations, several architectural solutions using special purpose parallel processor have been proposed, in order to meet the real time requirement in many applications. The solutions include parallel filter architecture, SIMD linear array architecture, SIMD multigrid architecture, 2-D block based architecture, and the AWARE's wavelet transform processor (WTP)   [9][10] [11]. Several versions of lifting scheme architecture have been compared and reported in literature. In terms of hardware complexity, the folded architecture in [12] is the simplest and the DSP-based architecture in [13] is the most complex. All other architectures have comparable hardware complexity and primarily differ in the number of registers and multiplexor circuitry. The control complexity of the architecture in [14] is very simple. In contrast, the number of switches, multiplexors and control signals used in the architectures of [15] is quite large. The control complexity of the remaining architectures is moderate. In terms of timing performance, the architectures in [14,12,[16][17][18] are all pipelined, with the architectures in [17] having the highest throughput (1/Tm). The architecture in [19] has fewer cycles since it is RPA based, but its clock period is higher. The architecture in [17] has the lowest computation delay.

In this paper, we propose, design, model, implement and compare the performances of three different DWT architectures. Section II briefly discusses the Lifting Scheme DWT algorithm for image processing, Section III discusses modified lifting base DWT and Section IV presents the FPGA implementation and compares the results of modified lifting algorithm. Conclusion is presented in Section VI.


# II.


# Dwt

The influx of sophisticated technologies in the field of image processing is affiliated with that of  The problem statement in the present section deals Figure 2 : Two-level DWT decomposition [6] With the design of the modified two-level DWT architecture for decomposition. The Discrete Wavelet Transform (DWT), which is based on sub-band coding is found top yield a fast computation of Wavelet Transform. It is easy to implement and reduces the computation time and resources required. In DWT, a time-scale representation of the digital signal is obtained using digital filtering techniques [6]. The signal to be analyzed is passed through filters with different cut-off frequencies at different scales as shown in figure 2.


# Lifting Scheme:

The Lifting Scheme is a well known method for constructing bi-orthogonal wavelets. The main difference with the classical construction is that it does not rely on the Fourier transform. The lifting scheme is an efficient implementation of a wavelet transform algorithm. It was primarily developed as a method to improve wavelet transform, and then it was extended to a generic method to create so-called second-generation wavelets. Second-generation wavelets are much more flexible and powerful than the first generation wavelets. The lifting scheme is an implementation of the filtering operations at each level [6]. The figure 3 represents the classical and lifting based implementations of DWT. ? SPLIT: In this step, the data is divided into ODD and EVEN elements. ? PREDICT: The PREDICT step uses a function that approximates the data set. The differences between the approximation and the actual data, replace the odd elements of the data set. The even elements are left unchanged and become the input for the next step in the transform. The PREDICT step, where the odd value is "predicted" from the even value is described by the equation [6].

? UPDATE: The UPDATE step replaces the even elements with an average. These results in a smoother input for the next step of the wavelet transform. The odd elements also represent an approximation of the original data set, which allows filters to be constructed. The UPDATE phase follows the PREDICT phase. The original values of the odd elements have been overwritten by the difference between the odd element and its even "predictor". So in calculating an average the UPDATE phase must operate on the differences that are stored in the odd elements [6]:

The equations for the lifting based implementation of the bi-orthogonal wavelet are:
( D D D D ) F 2012 Year (a) (b)
Odd j+1, i = odd j, i -P (even j, i )
(odd j+1, i ) Even j+1, i = even j, i + U Predict P1: d i 1 = ? (x 2i + x 2i+2 ) +x 2i+1 Update U1: a i 1 = ? (d i 1 +d i-1 1 ) + x 2i Predict P2: d i 2 = ? (a i1 + a i+1 ) +x 2i+1 Update U2: a i 2 = ? (d i 2 +d i-1 2 ) + a i
The figure 4 shows the lifting scheme architecture to realise the equations shown above. The input data x is first split into even and odd samples and each of the samples are taken through predict and update stages as per the architecture shown above. As the data moves from first stage to the last stage, data switching occurs at the input and output of every stage. Every stage consists of multipliers and adders. For the given set of Predict and Update stages, assuming the value of i = 0, the equation can be finalized. Thus the above integers are the values of the underlined coefficients in above equations. From the equations it is observed that there are common lifting coefficients to compute ai and di coefficients and there are input terms. The architecture realised by the above equations considering the constant coefficients is shown in the figure 5. 


# III.


# Modified lifting scheme
? + ?.? ) [ ? ( x 0 + x 2 ) + x 1 + ? ( x 0 + x -2 ) + x -1 ] + ?.?.?.? [?(x 2 + x 4 ) + x 3 + ?( x -2 + x -4 ) + x -3 ] + ?.?.? ( x 0 + x 2 + x 0 + x -2 ) + ? * x 0 di= 1/? [(2 * ?.? +1){ ? ( x 0 + x 2 ) + x 1 } + ?.? { ? ( x 0 + x -2 ) + x -1 + ?(x 2 + x 4 ) + x 3 } +?( x 0 + x 2 )].
The FPGA implementation of the modified lifting based DWT is designed based on the following:

? The input data X should be of 8 bit signed data. Thus by considering the above design specifications the architecture shown in the figure 5 is designed as per the requirements. ? The blocks from X -4 to X +4 resemble the input 9 samples designed in form of SIPO, each of 8-bit signed representation (serial in parallel out). ? Here the input stream is given through the single input line to 9 SIPOs. The outputs of those are taken in parallel to perform the addition and multiplication operations. ? The addition and multiplication operations are of 8 bit signed operators. ? The intermediate results of these addition and multiplication operations are stored in registers than preferring memory, as the data can be stored in registers with ease and in random, but in memory the storage (write operation in particular) should be done in orderly fashion. ? These intermediate registers are of PIPO structure and of 8 bit signed representation. ? Though final outputs ai and di are single bit, those are stored in the registers of PISO structure as the output should be taken for 8 bits.

The inputs can be 8-bit signed at any point of time. But the outputs should not be a signed number and can be more than 8-bits as every time the adders create an extra bit and the multipliers create more than one bit of data. That might be the major cause for the failure of the hardware; the architecture might not work properly. In order to minimize the error, suitable modifications are carried out.


# Modifications to minimize the Errors:

The few possible modifications that can be done for the calculations which can minimize the errors are:

1. An adder performs the addition of two 8-bit numbers and gives the result as a 9-bit number. Instead of a 9-bit number the LSB is discarded. As the Least Significant Bit is discarded the value of the number might not change drastically and the output still is an 8-bit data which is used for further operations.


# The multiplier performs multiplication of one 8-bit

number and the other is coefficient numbers. For each multiplication the hardware will be different so the final architecture requires a lot of multipliers which are of different width and again gives different output values. 3. The lifting co-efficient which should be of 8-bit signed number goes in decimal numbers like 0.458 so that the computation will become very difficult.

For multiplying this number multiplier takes more time to compute and the final output would be a decimal as 57.35.

The lifting coefficient is multiplied with an integer as 57 so that representing it might be much easier. These coefficients should be multiplied such that the final values should be obtained as 8-bit signed number and it should not have any decimal value as 57.000. This can be achieved by taking only the positive values and discarding all other decimal point values (e.g.XY.xy). Thus the values of all the lifting coefficients have integer without any decimal values so that the calculations can be much easier. From the architectural calculations, the values of ai and di are 65 and 39 respectively, and match with theoretical calculations.

By comparing those values we can come to know that:    IV.


# Hdl modeling and fpga implementation

The top level module or block of the DWT architecture is shown in the figure 8. The figure explains the input and output ports. The input ports are clk, en, piso_load, rst and ser_in and the output ports are ai and di. The input 9 samples each of 8 bit signed data is entered into the design through the ser_in input. The rst signal is used to reset the design when the signal is high. When the en signal is high, loading of the input data in all the 9-8 bit registers for 280 clock cycles is done. The piso_load signal is used to take the output at ai and di, and this signal is kept high for 8 clock cycles as the 8 bit is to be taken out through the single line. The HDL models of the sub-block can be understood from the internal hardware of the RTL schematic shown in the figure 9. The figure 9 represents the schematic of the DWT architecture where all the sub blocks can be viewed. Thus the sub blocks are modelled in such a way that the multipliers used are the IP cores from the XILINX library, and the adder that is designed for 8 bit signed addition is instantiated wherever necessary. The simulation of the top level module is shown in the figure 10 where the intermediate signals gives the performance of the sub blocks in the total simulation. The figure explains the integration of the sub blocks in the main top level architecture. Initially the sub blocks are designed by considering the DWT equation, the multiplier used in the design is a constant coefficient multiplier as it is faster than any other for the application required. For the present design, the constant coefficient multipliers are taken as a IP core from the XILINX library for different coefficients. The adder is 8-bit signed operator designed or modelled in the HDL and instantiated where it is necessary. The registers that are used in the design covers all the types SIPO, PIPO, PISO. SIPO at the initial stage while giving the inputs, PIPO while performing the operations intermediately, and the PISO at the output stage to take the outputs serially i.e. one bit for 8 clocks, as the required is two outputs of 8 bits taking serially. From the figure 9, the top level ports are shown; the serial input data is given in a random way. This is loaded inthe registers (SIPO), when enable signal is high, after 72 clock cycles the enable is made low, and for four clock cycles the operation is performed and the output is taken when the piso_load signal is high for 8 clock cycles as the output      


# Conclusion

In this work a modified lifting based DWT architecture is proposed, designed, modeled and verified. The design is modeled using HDL and is implemented on FPGA. The interfaces requried for data processing are also designed and is used to synchronize the data transfer operation. The HDL models and simulation of the sub blocks have been done to model the top-level design architecture. The test-bench to verify the functionality and performance of the sub modules and the top level architecture have been done. Implemented the design on FPGA and verified and debugged through the Chip-Scope. The Pre and Post Synthesis have been done and compared. The design can be further optimized for video signal processing.
![computers arena. Image Compression plays an important role of all the Image Processing techniques. The compression techniques are of two types: Lossless and Lossy. The most common image format that uses a lossy compression scheme is JPEG (Joint Photographic Experts Group) format. JPEG 2000 structure is wavelet based compression methodology that provides a number of benefits over the Discrete Cosine Transformation (DCT) compression method, which was used in JPEG format. Wavelet compression converts the image into a series of wavelets that can be stored more efficiently than pixel blocks. The Wavelet compression is accomplished through the use of JPEG 2000 encoder as shown in the figure1.](image-2.png "")
1![Figure 1 : JPEG 2000 Block Diagram](image-3.png "Figure 1 :")
3![Figure 3 : a) Classical Implementation, b) Lifting scheme based DWT [6] Lifting Scheme consists of three steps: SPLIT, PREDICT and UPDATE, as shown in the figure 3 (b).? SPLIT: In this step, the data is divided into ODD and EVEN elements. ? PREDICT: The PREDICT step uses a function that approximates the data set. The differences between the approximation and the actual data, replace the odd elements of the data set. The even elements are left unchanged and become the input for the next step in the transform. The PREDICT step, where the odd value is "predicted" from the even value is described by the equation[6].](image-4.png "Figure 3 :")
4![Figure 4 : Lifting Scheme Architecture](image-5.png "Figure 4 :")
![By re-arranging all the values and the constant co-efficient, the final equation can be derived.Being a dedicated DWT core for JPEG 2000, the filter coefficients are fixed. The filter coefficients are: ? = 1.586134342, ? = 0.05298011854, ? = 0.8829110762, ? =0.4435068522, ? = 1.149604398. By substituting the above values in the modified equation, the coefficient values obtained then are also decimals, by multiplying them with constants they form integers as: 1 * 32 = 57, 2 * 256 = 6, 3 * 64 = 30, 4 * 32 = 35, 5 * 256 = 12, 6 * 32 = 26, 7 * 32 = 50.](image-6.png "")
5![Figure 5 : Modified Lifting Scheme Architecture for DWT](image-7.png "Figure 5 :")
![of constant coefficient type to reduce the complexity of multiplication the IP cores in XILINX are used. Of all the sub-blocks the Adder has the highest delay and the highest utilisation of the resources. Thus by instantiating these sub-blocks the area utilised by the DWT architecture is 12% and the delay is 3.313ns. From the table 1 the delays of individual blocks are known. Almost all work at different clock frequencies, as the delay mentioned in the table is the minimum period of the design clock. But the whole or the top level design should work at one clock frequency, thus the concept of synchronising the clocks arise. The clock frequency of top level architecture should synchronise with the sub modules, in general the problem of Synchronisation is addressed by any of these below: ? Increase system clock period (usually not feasible). ? Decrease tcomb (use no combinational logic). ? Decrease tsu (use fast flip-flops) ? Increase synchroniser clock period. The figure 6 represents the clock synchroniser.](image-8.png "")
6![Figure 6 : Clock Synchroniser](image-9.png "Figure 6 :")
7![Figure 7 : Configuring DCM [2] Thus by configuring the DCM in the frequency mode the tool generates the instantiation template and thus that instantiation template is used in the design to make the design run on same clock. The operating frequency of the present design runs at 280MHz. Observations: ? The equation of the lifting scheme for two-level DWT is simplified based on the basic equations mentioned. ? The simplified equation is made into an architecture such that both the ai and di is implemented using the same architecture. ? The mathematical and the architectural computation of the equation are computed and compared, and observed that the architectural computations are the modified version of the mathematical, where the discarding of the LSBs result to scaling down of the original values. ? The power, area and the delay of the sub-blocks are observed and noticed that the Adder takes the maximum delay i.e. 3.39ns and maximum utilisation of resources i.e. 13%, and registers SIPO and PIPO takes the least delay and least utilisation of resources i.e. 1%](image-10.png "Figure 7 :")
89![Figure 8 : Top-level DWT architecture](image-11.png "Figure 8 :Figure 9 :")
![8 bit. Thus the same procedure follows for 8 (load)+ 4 (operation) + 8 (output) = 22 clocks. To program a single device using iMPACT, all needed is a bitstream file. To program several devices in a daisy chain configuration, or to program devices using a PROM, iMPACT is used to create a PROM file. iMPACT accepts any number of bitstream and creates one or more PROM files containing one or more daisy chain configurations.](image-12.png "")
10![Figure 10 : Simulation results for the DWT architecture](image-13.png "Figure 10 :")
11![Figure 11 : Program downloaded into FPGA](image-14.png "Figure 11 :")
12![Figure 12 : Post-Synthesis simulation](image-15.png "Figure 12 :")


11. The architecturally calculated values are of 8-bitsinged representation while theoretically calculatedare unsigned.2. The architectural values do not have any decimalvalues.3. The architectural values do not exceed more than 8-bit.4. The intermediate calculations will be always 8-bitand signed instead of 9 or more bits.5. The outputs of the adder in architecturalcalculations are 8-bit by discarding the LSB thanhaving 9-bits which will be continued to increase fornext level of addition.Estimation of Power, Area and Delay of Sub-Blocks of Architecture:The main sub-blocks of the modified liftingscheme architecture are:? Adders? Multipliers (Constant Coefficient-IP Cores)? RegistersThe table 1 represent the estimation of Power,Delay and Area of these sub blocks.
2
			© 2012 Global Journals Inc. (US)Global Journal of Computer Science and Technology
			© 2012 Global Journals Inc. (US)
		
		
* 
	
		Digital Coding of Waveforms: Principles and Applications to Speech and Video
		
			NJayant
		
		
			PNoll
		
		
			1984
			Prentice-Hall
			Englewood Cliffs, NJ
		
	
* 
	
		A wavelet core for video processing
		
			CDiou
		
		
			LTorres
		
		
			MRobert
		
	
		presented at the IEEE Int. Conf. Image Process
				
			Sept. 2000
		
	
* 
	
		Signal compression based on models of human perception
		
			NJayant
		
		
			JJohnston
		
		
			RSafranek
		
	
		Proc. IEEE
				IEEE
		
			Oct. 1993
			81
			
		
* 
	
		Coding techniques in multimedia communications
		
			BZovko-Cihlar
		
		
			SGrgic
		
		
			DModric
		
	
		Proc. 2nd Int. Workshop Image and Signal Processing, IWISP'95
				2nd Int. Workshop Image and Signal essing, IWISP'95Budapest, Hungary
		
			1995
			
		
* 
	
		Digital Compression and Coding of Continuous Tone Still Images, ISO/IEC IS 10918
		
			1991
		
	
* 
	
		
			IDaubechies
		
	
		Ten Lectures on Wavelets. Philadelphia, PA: SIAM
		
			1992
		
	
* 
	
		A theory of multiresolutio n signal decomposition: The wavelet representation
		
			SMallat
		
	
		IEEE Trans. Pattern Anal. Machine Intell
		
			11
			
			July 1989
		
	
* 
	
		Design and Implementation of Parallel and Pipelined Distributive Arithmetic Based Discrete Wavelet Transform IP Core
		
			MNagabushanam
		
		
			CyrilPrasannaRaj
		
		
			P
		
		
			SRamachandran
		
	
		European Journal of Scientific Research
		1450-216X
		
			35
			3
			
			2009
		
	
* 
	
		VLSI architectures for the discrete wavelet transform
		
			MVishwanath
		
		
			ROwens
		
		
			MJIrwin
		
	
		IEEE Trans. Circuits Syst. II
		
			42
			
			May 1995
		
	
* 
	
		Discrete wavelet transform: Data dependence analysis and synthesis of distributed memory and control array architectures
		
			JSFridman
		
		
			ESManolakos
		
	
		IEEE Trans. Signal Processing
		
			45
			
			May 1997
		
	
* 
	
		A high speed systolic architecture for discrete wavelet transforms
		
			TAcharya
		
	
		Proc. IEEE Global Telecommun. Conf
				IEEE Global Telecommun. Conf
		
			1997
			2
			
		
* 
	
		Lifting Based Discrete Wavelet Transform Architecture for JPEG2000
		
			CLian
		
		
			KFChen
		
		
			HHChen
		
		
			LGChen
		
	
		IEEE International Symposium on Circuits and Systems
				Sydney, Australia
		
			2001
			
		
* 
	
		Novel JPEG 2000 Compliant DWT and IWT VLSI Implementations
		
			MMartina
		
		
			GMasera
		
		
			GPiccinini
		
		
			MZamboni
		
	
		Journal of VLSI Signal Processing
		
			34
			
			2003
		
	
* 
	
		Design and Implementation of a Progressive Image Coding Chip Based on the Lifted Wavelet Transform
		
			CCLiu
		
		
			YHShiau
		
		
			JMJou
		
	
		Proc. of the 11th VLSI Design/CAD Symposium
				of the 11th VLSI Design/CAD SymposiumTaiwan
		
			2000
		
	
* 
	
		Efficient Architectures for 1-D and 2-D Lifting-BasedWavelet Transform
		
			HLiao
		
		
			MKMandal
		
		
			BFCockburn
		
	
		IEEE Transactions on Signal Processing
		
			52
			5
			
			2004
		
	
* 
	
		A Line-Based, Memory Efficient and Programmable Architecture for 2D DWT Using Lifting Scheme
		
			WHChang
		
		
			YSLee
		
		
			WSPeng
		
		
			CYLee
		
	
		IEEE International Symposium on Circuits and Systems
				Sydney, Australia
		
			2001
			
		
* 
	
		Flipping Structure: An Efficient VLSI Architecture for Lifting-Based Discrete Wavelet Transform
		
			CTHuang
		
		
			PCTseng
		
		
			LGChen
		
	
		IEEE Transactions on Signal Processing
				
			2004
			
		
* 
	
		A VLSI Architecture for Lifting-Based Forward and InverseWavelet Transform
		
			KAndra
		
		
			CChakrabarti
		
		
			TAcharya
		
	
		IEEE Trans. of Signal Processing
		
			50
			4
			
			2002
		
	
* 
	
		Novel Architectures for Lifting-Based Discrete Wavelet Transform
		
			HLiao
		
		
			MKMandal
		
		
			BFCockburn
		
	
		Electronics Letters
		
			38
			18
			
			2002