Entropy of Data Compression Using Object Oriented Data Warehousing

Table of contents

1. INTRODUCTION

data warehouse is a mechanism for data storage and data retrieval. Data can be stored and retrieved with a multidimensional structure-hypercube or relational, a star schema structure or several other data storage techniques. The task of transitioning from a procedural mindset to an objectoriented paradigm can seem overwhelming; however, the transition does not require developers to step into another dimension or go to Mars in order to grasp a new way of doing things. In many ways, the object-oriented approach to development more closely mirrors the world we've been living in all along: We each know quite a bit about objects already. It is that knowledge we must discover and leverage in transitioning to object-oriented tools and methodologies. Our research has been from a different point of view -our primary motivating factor is to show how existing applications can be enhanced using object -oriented Technology. Like Many new ideas, object -oriented programming does not have a universally accepted definition [1,2]. Ideas on the subject do, however, seem to be converging the "best" definition that we have seen to date is "object-oriented = object + classes + inheritance" [3]. OOP can also be defined as an extension of the idea of abstract data type. The task of transitioning from a procedural mindset to an object -oriented paradigm can seem overwhelming: however, the transition does not require developers to step into another dimension or go to grasp a new way of doing thing. In many ways, the object oriented approach to development more closely mirrors the world we've been living in all along. [8]. we each know quite a bit about objects already. It is that knowledge we must discover and leverage in transitioning to object-oriented tools and methodology.

2. II. ENTROPY IN DATA COMPRESSION

Data compression is of interest in business data warehousing, both because of the cost savings it offers and because of the large volume of data manipulated in many business applications. The types of local redundancy present in business data files include runs of zeros in numeric fields, sequences of blanks in alphanumeric fields, and fields which are present in some records and null in others. Run length encoding can be used to compress sequences of zeros or blanks. Null suppression may be accomplished through the use of presence bits. Another class of methods exploits cases in which only a limited set of attribute values exist. Dictionary substitution entails replacing alphanumeric representations of information such as bank account type, insurance policy type, sex, month, etc. by the few bits necessary to represent the limited number of possible attribute values. The problem of compressing digital data can be decoupled into two subproblems: modeling and entropy coding. Whatever the given data may represent in the real world, in digital form it exists as a sequence of symbols, such as bits. The modeling problem is to choose a suitable symbolic representation for the data and to predict for each symbol of the representation the probability that it takes each of the allowable values for that symbol. The entropy-coding problem is to code each symbol as compactly as possible, given this knowledge of probabilities. (In the realm of lossy compression, there is a third subproblem: evaluating the relative importance of various kinds of errors.)

For example, suppose if it is required to transmit messages composed of the four letters a, b, c, and d. A straightforward scheme for coding these messages in bits would be to represent a by \00", b by \01", c by \10" and d by \11". However, suppose if it is known that for any letter of the message (independent of all other letters), a occurs with probability .5, b occurs with probability .25, and c or d occur with probability .125 each. Then a shorter representation might be chosen for a, at the necessary cost of accepting longer representations for the other letters. a could be represented by \0", b by \10", c by \110", and d by \111". This representation is more compact on average than the first one; indeed, it is the most compact representation possible (though not uniquely so). In this simple example, the modeling part of the problem is determining the probabilities for each symbol; the entropy-coding part of the problem is determining the representations in bits from those probabilities; the probabilities associated with the symbols play a fundamental role in entropy coding. One well-known method of entropy coding is Huffman coding, which yields an optimal coding provided all symbol probabilities are integer powers of .5. Another method, yielding optimal compression performance for any set of probabilities, is arithmetic coding. In spite of the superior compression given by arithmetic coding, so far it has not been a dominant presence in real data-compression applications. This is most likely due to concerns over speed and complexity, as well as patent issues; a rapid, simple algorithm for arithmetic coding is therefore potentially very useful. An algorithm which allows rapid encoding and decoding in a fashion akin to arithmetic coding is known as the Q-coder. The QM-coder is a subsequent variant. However, these algorithms being protected by patents, new algorithms with competitive performance continue to be of interest. The ELS algorithm is one such algorithm.

The ELS-coder works only with an alphabet of two symbols (0 and 1). One can certainly encode symbols from larger alphabets; but they must be converted to a two-symbol format first. The necessity for this conversion is a disadvantage, but the restriction to a two-symbol alphabet facilitates rapid coding and rapid probability estimation.

The ELS-coder decoding algorithm has already been described. The encoder must use its knowledge of the decoder's inner workings to create a data stream which will manipulate the decoder into producing the desired sequence of decoded symbols. As a practical matter, the encoder need not actually consider the entire coded data stream at one time. One can partition the coded data stream at any time into three portions; from end to beginning of the data stream they are: preactive bytes, which as yet exert no in seuence over the current state of the decoder; active bytes, which affect the current state of the decoder and have more than one consistent value; and postactive bytes, which affect the current state of the decoder and have converged to a single consistent value. Each byte of the coded data stream goes from preactive to active to postactive; the earlier a byte's position in the stream, the earlier these transitions occur. A byte is not actually moved to the external _le until it becomes postactive. Only the active portion of the data stream need be considered at any time. Since the internal buffer of the decoder contains two bytes, there are always at least two active bytes. The variable backlog counts the number of active bytes in excess of two. In theory backlog can take arbitrarily high values, but higher values become exponentially less likely.

3. III.

METHODOLOGY Following steps will be taken for the future work 1. Creation of different sizes of databases in oracle 2. Employment of object oriented programming for compression using datawarehousing 3. Further compression of database csv files using

4. CONCLUSION

A data warehouse is an essential component to the decision support system. The traditional data warehouse provides only numeric and character data analysis. But as information technologies progress, complex data such as semi-structured and unstructured data become vastly used [2], [3]. Data Compression is of interest in business data warehousing, both because of the cost saving it offers and because of the large volume of data manipulated in many business application. The entropy is used in many areas such as image processing, document images. But in our research we used the entropy in object oriented data warehousing. Creation of different sizes of databases in oracle. Employment of object oriented programming for compression using datawarehousing. Further compression of database .csv files using C++. Comparison of time taken and compression efficiency for different sizes of databases.

Figure 1. C++ 4 .GlobalGlobal 2011 October
42011Comparison of time taken and compression efficiency for different sizes of databases. Journal of Computer Science and Technology Volume XI Issue XVII Version I Journal of Computer Science and Technology Volume XI Issue XVII Version I 74 Entropy of Data Compression Using Object Oriented Data Warehousing V.
1
2

Appendix A

  1. Entropy-and complexityconstrained classified quantizer design for distributed image classification. A Ortega . Multimedia Signal Processing 2002. 11 Dec. 2002. 9 p. . (IEEE Workshop on)
  2. Lossless Compression Using Conditional Entropy-Constrained Subband Quantization. A Scales , W Roark , F Kossentini , M J T Smith . Data Compression Conference, 1995. DCC '95. Proceedings, Mar 1995. 498 p. .
  3. 2-D Compression of ECG Signals Using ROI Mask and Conditional Entropy Coding. Boqiang Huang , ; Yuanyuan Wang; Jianhua Chen . Biomedical Engineering April 2009. 56 (4) p. . (IEEE Transactions on)
  4. Lin Wen-Yang "Three maintenance algorithms for compressed object-oriented data warehousing, Chen Wei-Chou , Hong Tzung-Pei .
  5. Context-based entropy coding of block transform coefficients for image compression. C Tu , T D Tran . IEEE Transactions on Nov 2002. 11 (11) p. . (Image Processing)
  6. Entropy coding with variable length re-writing systems. H Jegou , C Guillemot . ISIT 2005. Proceedings. International Symposium on, 2005. Sept. 2005. p. . (Information Theory)
  7. , Hua Xie .
  8. Wavelet entropy based no-reference quality prediction of distorted/decompressed images. I De , J Sil . 2nd International Conference on, 2010. April 2010. 3 p. .
  9. ANFIS tuned no-reference quality prediction of distorted/decompressed images featuring wavelet entropy. I De , J Sil . Computer Information Systems and Industrial Management Applications (CISIM), 2010 International Conference on, 8-10 Oct. 2010. p. .
  10. The Novel Model of Object-Oriented Data Warehouses. J C Shieh , H W Lin . Workshop on Databases and Software Engineering, 2006.
  11. , Kim Sang Hyun .
  12. An entropy based segmentation algorithm for computergenerated document images. L Liu , Y Dong , X Song , G Fan . Proceedings. 2003 International Conference on, (2003 International Conference on) 2003. 2003. 1 p. .
  13. A novel approach to scene change detection using a cross entropy. Rae-Hong Park . Proceedings. 2000 International Conference on, (2000 International Conference on) 2000. 2000. 3 p. .
  14. Using difficulty of prediction to decrease computation: fast sort, priority queue and convex hull on entropy bounded inputs. S Chen , J H Reif . Proceedings., 34th Annual Symposium on, (34th Annual Symposium on) 1993. Nov 1993. p. .
  15. Using the compressed data model in objectoriented data warehousing. Wei-Chou Chen; Tzung-Pei Hong; Wen-Yang Lin . IEEE SMC '99 Conference Proceedings. 1999 IEEE International Conference on, 1999. 1999. 5 p. .
  16. , Wei-Chou Chen; Tzung-Pei Hong .
  17. A composite data model in object-oriented data warehousing. Wei-Yang Lin . TOOLS 31. Proceedings 1999. 1999. p. . (Technology of Object-Oriented Languages and Systems)
  18. On entropyconstrained residual vector quantization design. Y Gong , M K H Fan , C.-M Huang . Data Compression Conference, 1999. Mar 1999. 526 p. . (Proceedings. DCC '99)
Notes
1
© 2011 Global Journals Inc. (US) Global Journal of Computer Science and Technology Volume XI Issue XVII Version I
2
Entropy of Data Compression Using Object Oriented Data Warehousing© 2011 Global Journals Inc. (US)
Date: 2011-09-17