Incremental Maintenance of a Materialized View in Data Warehousing: An Effective Approach

Table of contents

1. Introduction

t has been observed that in most typical data analysis and data mining applications, timeliness and interactivity are more important considerations than accuracy; thus, data analysts are often willing to overlook small inaccuracies in the answer, provided that the answer can be obtained fast enough. This observation has been the primary driving force behind the recent development of approximate query processing techniques for aggregation queries in traditional databases and decision support systems [4], [5]. Numerous approximate query processing techniques have been developed: The most popular ones are based on random sampling, where a small random sample of the rows of the database is drawn, the query is executed on this small sample, and the results are extrapolated to the whole database. In addition to simplicity of implementation, random sampling has the compelling advantage that, in addition to an estimate of the aggregate, one can also provide confidence intervals of the error, with high probability.

Broadly, two types of sampling-based approaches have been investigated: 1) pre-computed samples, where a random sample is pre-computed by scanning the database and the same sample is reused for several queries and 2) online samples, where the sample is drawn "on the fly" upon encountering a query. So the selection of these random samples in distributed environments for query processing is addressed in [6]. Data warehouses (DW) [6] are built by gathering information from data sources and integrating it into one virtual repository customized to users' needs. One important task of a Data Warehouse Management System (DWMS) is to maintain the materialized view upon changes of the data sources, since frequent updates are common for most data sources. In addition, the requirements of a data source are likely to change during its life-cycle, which may force schema changes for the data source. A schema change could occur for numerous other reasons, including design errors, the addition of new functionalities and even new developments in the modeled application domain. Even in fairly standard business applications, rapid schema changes have been observed. In [10], significant changes (about 59% of attributes on the average) were reported for seven different applications over relational databases. A similar report can also be found in [15]. These applications ranged from project tracking, sales management, to government administration.

In situations that real-time refreshment of the data ware-house content is not critical; changes to the sources are usually buffered and propagated periodically such as once a day to refresh the view extent. Two benefits are possible. One is to gain better maintenance performance. The other is that there are less conflicts with DW read sessions. In a data update only environment, most view maintenance (VM) algorithms proposed in the literature [17,1,14] group the updates from the same relation and maintain such a large delta change in a batch fashion. However, these algorithms would fail whenever source schema changes occur, which are also common as stated above. One obvious reason is that the data updates in this group may be schema inconsistent with each other if there are some schema changes in between. On the other hand, work has begun on incorporating source schema changes into the data warehouse, namely, view synchronization (VS) [8] aims at rewriting the DW view definition when the source schema has been changed.

To handle the delete of any schema information of a data source, VS tries to locate an alternative source for replacement to keep the new view semantically as close to the original view as possible. Thereafter, view adaptation (VA) [12] incrementally adapts the view extent to keep the new view consistent. Such algorithms are also not sufficient to batch a group of mixed data updates and schema changes, since there could be a number of schema changes interleaved with some data updates. In this paper, we propose a solution strategy that is capable of batching a mixture of both source data updates II.

2. Definition of Terms

View evaluation can be represented by a tree, called an expression tree [5,9]. An expression tree is a tree, where the leaf nodes represent base relations and non-leaf nodes represent binary expressions in the relational algebra. The unary relational algebraic expressions are associated along the edges. A view or a query is optimized by the query optimizer before executing it. A query optimizer takes an expression tree as input and produces an output, called an optimized expression tree, which determines the internal sequence of operations for executing a query. Thus, an optimized expression tree defines a partial order in which operations must be performed in order to produce the result of the view.

3. Depth:

The depth of leaf nodes, that is data base relations is 0. The depth d of a node is defined as max(depth of descendents)+1.

4. Height:

The height of the optimized expression tree is defined as the maximum depth of any node in the tree.

Given a node i in the expression tree, its parent is denoted by i, and op(i) and op( i) are the expressions associated with i and i, respectively. The children of node i are denoted by i' and i'' where i' is a sibling of i'' and vice versa. IR i denotes the intermediate result of node i. The auxiliary relation associated with node I is denoted AR i in the case where only one relation is needed, and by AR 1 i and AR 2 i when two are needed. The key of IR i is denoted by K i , and the keys of IR i' and IR i'' are denoted by K i' and K i'' , respectively. Insertion and deletion of tuples are denoted by and respectively. The symbol ? either an inserted set or a deleted set of tuples. The instance of a relation, say Ri, before and after an update is denoted by Ri old an d Ri n ew respectively, similiary for an auxiliary relation AR and a materialized view V.

5. III.

6. Example & Simplification

Consider a data warehouse for a large research organization which has got many departments and each department has many research groups. Suppose this data warehouse is collecting data from four base relations whose schemas are as follows:

7. R1:

emp_rschr(rschr_id,rname,deptno,major) This relation gives the researchers id, name, department and major.

8. R2:emp_paperpublish(rschrid,paper_id,paper_title,sour ce_of_publiscation, year_of_publish)

This gives researchers id,paper id, paper title, source of publication and year of publish.

9. R3: emp_manager(rschr_id,deptno)

This relation contain one record for each manager and his department. Assume that each department has one manager. Since a manager is also a researcher, relation emp_rschr has a tuple for each manager.

10. R4: emp_groupleader(rschr_id,deptno)

This relation contains information about th research group name and who is leading this group. Since a group leader is also a researcher, relation emp_rshcr has a tuple for each group leader.

Suppose a user of the organization is interested in materializing and maintaining the following view:

'Researchers other than managers and group leaders along with their departments who have published more than 10 papers in the year 2010.'

In

11. C

to the auxiliary relations incrementally. Deferring the changes allows analysts that query the warehouse to see a consistent snapshot of the data throughout the day, and can make the maintenance more efficient. Figure 1 shows the optimized expression tree for the above view. Here, the nodes at leaf level are base relations and non-leaf nodes are expressions. Each nonleaf node in the tree corresponds to a relational algebraic expression given above.

12. Figure 1: Expression tree

Suppose Researchers or Paper_Public relations are updated. In this case we materialize the two auxiliary relations View2 and View3. The contents of these views are derived while computing the view first time. By materializing these two auxiliary relations in the warehouse, the view is self-maintainable along with these auxiliary relations. Suppose new researchers joined the organization, therefore, one tuple for each new researcher in emp_rschr relation has to be inserted. These insertions will led to generate tuples that to be inserted in rschr_ex_manager_groupleader. Since these new researchers have not published any paper at the time of joining, these tuples cannot join with any tuples of emp_paper_publish, thus there will no change in the materialized views. Therefore, all auxiliary relations and materialized views are self_maintainable. Now consider another case where a set of tuples is inserted in emp_paper_publish relation, say R. Then, we first compute the research paper those are published in year 2010 and then it is join with rschr_ex_managergroupleader view. Lastly the intermediate result is grouped in the final auxiliary relation by performing count operation. In this case also, the view and auxiliary relations are self-maintainable.

13. IV.

14. Procedure of Materialize Views Maintenance

The materialize view maintenance process can be divided into two functions: 1. Propagate and 2. Refresh. The work of computing the auxiliary relations happens within the propagate function, which can take place without locking materialize views so that the warehouse can continue to be made available for querying by analysts. Materialize views are not locked until the refresh function, during which time the materialize views are updated from the auxiliary relations.

The propagate function involves updating the auxiliary views incrementally from deferred set of changes. The final auxiliary view represents the net changes to the materialize views due to the changes in the underlying data sources.

The refresh function applies the net changes represented in the final auxiliary relation to the materialize views. This process carried out after a specific time interval or when the system has free cycles. So none of the data warehouse users or operations are affected by the view maintenance process. None of the query has to pay for view maintenance. The materialize view maintenance process totally hidden by users and running transactions. Whenever an interested change happens in the underlying data source, simply this desire change is stored in the auxiliary relations by comparing and joining it with others relations if required. This change is passed to the higher level auxiliary relations. Again the change is integrated and circulated to final auxiliary relation. Lastly the change is refreshed into the data warehouse when the refresh trigger is occur.

15. a) Analytical Cost Model

In this section we show the performance results of our materialize view maintenance method. The results are based on the following cost model.

16. Global Journal of Computer Science and Technology

Volume XVIII Issue III Version I

17. i. Cost Model

The overall view maintenance cost of materialized views includes the cost of propagate the changes and the cost of refresh operations. Let V 1 ,V 2 ,?.,V m be the m materialized views. Let B 1 , B 2 ,?,B n be the n base relations and A 1 ,A2,?A i be the i auxiliary relations. Let f u1 B1 ,?..,f un Bn be the update frequency to the base relations. Let C ij B->A be the cost of propagating an update on base relation B i to auxiliary relation A j and C jk A->V be the cost of refresh of auxiliary A j to materialized view V k . The overall cost of maintaining the views when keeping both the materialized views and the auxiliary relations is:

C MV +AR=? ?f ???? ???? ? * (? C?> ??? ?? ?? ??? ?? ?? ????? ?? =1 ?? =1 ?? ?? ?? ??? ??=?? ??=1

The total view maintenance cost with no auxiliary relations is:

C MV =? ?f ???? ???? ? * ??=?? ??=1 (? C ??=?? ??=1

It is obvious that the cost of maintaining the materialized views directly from base relations is much more than the cost of maintaining materialized views through auxiliary relations.

V.

18. Evaluation

To verify the feasibility and effectiveness of our view maintenance strategies and corresponding optimization framework, we have implemented the proposed techniques using Oracle 9i. All experiments were performed on a workstation with Pentium D 3.2 GHz processor, 1 GB of memory and 160 GB disks, running Windows XP.

Relation R1 contain 500000 records, R2 contains 25000 records, where as in R3 there are records of individual manager of a department and in R4 holds the records of group leaders. We considered two types of changes:

Update-Generating changes: Insertions and deletions of an equal number of tuples over existing researchers and paper publishers. These changes mostly cause updates amongst the existing tuples in materialized view.

Insertion-Generating changes: Insertions over new researchers those who published certain number of research papers. These changes cause only insert into paper publish table.

The insertion-generating changes are very meaningful since in many data warehousing applications the only changes to the fact tables are insertions of tuples for new dates, which leads to insertions into materialized views.

Figure 2 shows four graphs illustrating the performance advantage of using incremental materialized view maintenance method which uses auxiliary views to store intermediate results. The view maintenance time is split into two functions propogate and refresh. While computing the intermediate result the data warehouse is remain free to the user.

Figure 2 (a) and (b) plot the variation in elapsed time as the size of the change set changes(delta relation), for a fixed size 500000 records in emp_rschr relation and 250000 records in emp_paperpublish relation.

We found that the incremental materialize view maintenance using auxiliary relations wins for both types of changes, but it wins with a greater margin for the update generating changes. The refresh time is going down by 20% in figure 2(b).

19. Conclusions

We have investigated one of the significant problems of a data warehouse, that is, materialized view maintenance and how to make warehouse materialized views self maintainable without accessing the data from underlying data sources. The study shows that it is possible to make warehouse views self maintainable by materializing additional auxiliary relations, which contain intermediate results, at a data warehouse site. Using efficient incremental materialize view maintenance technique it is possible to reduce the cost of view maintenance. Proposed materialize view maintenance technique using auxiliary relation and dividing the maintenance process into two steps: propagate and refresh require less maintenance time as compared to counting algorithm. Here the propagate function works implicitly and whenever the data warehouse is ideal the refresh function integrate the data into data warehouse views. The entire maintenance process is hidden from the data warehouse users.

Figure 1.
Year 2018
12
Volume XVIII Issue III Version I Create view mngr_or_groupleader (rschr_id, deptno) as select rschr_id, deptno from emp_rschr
UNION // This view is for finding manager and group leader Create view rschr_ex_ manager_or_groupleader (rschr_id, deptno) as select reshr_id, deptno from emp_rschr where NOT EXISTS (select *from mngr_or_grouple ader where emp_rschr.id=mngr_ or_groupleader.id) //This view is for finding researcher, those are not manager or group leader. Create view rschrpaperview2010 (rschr_id, paper_id, deptno) as select emp_paperpublish.rschr_id, paper_id, deptno from rschr_ex_manager_or_group leader, emp_paperpublish where rschr_ex_ manager_or_ group leader.rschr_id=emp_paperpublish.rschr_id and year= '2010'. //This view gives the researcher those who have published paper in the year 2010. Create view rschrpaperview(rschr_id,deptno) as Select rschr_id, deptno from rschrpaper view2010 group by rschr_id having count(*)>10; // Global Journal of Computer Science and Technology (select rschr_id, deptno from emp_groupleader) ( )
1

Appendix A

Appendix A.1

Appendix B

  1. Maintaining Materialised Views in Distributed databases. A Segev , J Park . Proceedings of the IEEE International Conference on Data Engineering, (the IEEE International Conference on Data Engineering) 1989.
  2. Optimizing Cyclic Join View Maintenance over Distributed Data Sources. Bin Liu & Elke , A Rundensteiner . IEEE Transactions on Knowledge and Data Engineering March 2006. 18 (3) .
  3. Multi agent Immediate Incremental View Maintenance for Data 10. Warehouses. C H Gray , William A Yeung , Gruver . IEEE Transaction on Systems, Man & Cybermetics-Part A: Systems & Human, March 2005. 35.
  4. Efficient View Maintenance at Data Warehouses. D Agarwal , A E Abbadi , A Singh , T Yurek . Proc. ACM SIGMOD, (ACM SIGMOD) 1997. p. .
  5. Special Issue on Materialized Views and Data Warehousing. D Lomet , J Widom . IEEE Data Engineering Bulletin June 1995. 18 (2) .
  6. Database System Concepts, E Heney , Abraham Korth , Silberschatz . 1986. McGraw Hill.
  7. A performance analysis of view materialization strategies. E N Hanson . SIGMOD pages, 1987. p. .
  8. Incremental View Maintenance on Multi-Source, Gianluca Moro , Claudio Sartori . 2001. (In proceedings of IEEE)
  9. Efficient Updating Materialized Views. J A Blakeley , P A Larson , F W Tompa . Proc. ACM SIGMOD, (ACM SIGMOD) May 1986. p. .
  10. DyDa: Data Warehouse Maintenance under Fully Concurrent Environments. J Chen , X Zhang , S Chen , K Andreas , E A Rundensteiner . Proc. ACM SIGMOD Demo Session, (ACM SIGMOD Demo Session) 2001. p. 619.
  11. The Stanford Data Ware housing Project. J Hammer , H Garcia-Molina , J Widom , W Labio & Zhuge . IEEE Data Engineering Bulletin June1995.
  12. Lazy Maintenance of Materialized Views. Jingren Zhou , Perake Larson , Hicham G Elmongui . Proceddings of 33 rd International conference on VLDB, (eddings of 33 rd International conference on VLDBVienna, Austria
    ) 2007.
  13. Asymmetric Batch Incremental View Maintenance. Junyi Hao He , Jun Xie , Hai Yang , Yu . the Proceedings of the 21 st International Conference on Data Engineering, 1084-4627/05, 2005.
  14. Supporting Multiple View Maintenance Policies. L S Colby , A Kawaguchi , D F Lieuwen , I S Mumick , K A Ross . Proceeding ACM SIGMOD International Conference on Management of Data, (eeding ACM SIGMOD International Conference on Management of Data) 1977.
  15. Database Snapshots. M Adiba , & B Line\dsay . Proceedings of the sixth International Conference on Very Large Databases, (the sixth International Conference on Very Large DatabasesMontreal, Canada
    ) October 1980. p. .
  16. Avoiding re-computation: View Adaptation in Data Warehouses. M Mohnia . Proc. Of 8 th International Database Workshop, (Of 8 th International Database WorkshopHong Kong
    ) 1997. p. .
  17. Efficient View Self-Maintenance. N Hyun . Proceeding of ACM workshop on Materialized views: Techniques & Applications, (eeding of ACM workshop on Materialized views: Techniques & Applications) June 7, 1996.
  18. An Incremental Access Method for Viewcache: Concept, Algorithms and Cost Analysis. N Roussopoulos . ACM Trans. On Database Systems 1991. 16 (3) p. .
  19. Incremental Evaluation of Rules & Its Relationship to Parallesim. O Wolfson , H M Dewan , S J Stolfo , Yemini . Proceedings ACM IGMOD, International Conference on Management of Data, (ACM IGMOD, International Conference on Management of Data) 1991. p. .
  20. A framework for supporting data integration using the materialized & virtual approaches. R Hull , & G Zhou . SIGMOD Int'l Conference, June 4 -6,1996.
  21. Efficient Incremental Evaluation of Queries with Aggregation. R Ramakrishan , K A Ross , D Srivastava , S Sudarshan . International Logic Programming Symposium, 1994.
  22. Incremental View Maintenance: An Algorithmic Approach. S Abdulaziz , Almazyad & Mohammad Khubeb , Siddiqui . Internatioinal Journal of Electrical & Computer Sciences IJECS-IJENS 2009. 10 (03) .
  23. An Overview of Data Warehousing and OLAP Technology. S Chaudhuri , U Dayal . ACM SIGMOD Record, volume26, 1974. p. .
  24. Multiversion Based View Maintenance over Distributed Data Sources. S Chen , B Liu , E A Rundensteiner . ACM Trans. Database Systems (TODS) 2004. 29 (4) p. .
  25. Algorithms for Defered View Maintenance. S Latha , Timothy Colby , Leonid Griffin , Singh Libkin , Howard Mumick , Tricky . proceedings of ACM SIGMOD, (ACM SIGMOD) 1996.
  26. A Comparative Study of Materialized View Maintenance Techniques in Data Warehousing. S Solanki , Dr , Ajay Kumar . IJRIME August 2011. 1 (2) .
  27. A Comprehensive Study of Data Warehousing, S Solanki , Dr , Ajay Kumar . January 2012. (IJMR, Vol1, Issue 1)
  28. Currency based updates to distributed materialized Views. W Segev , Fang . proceedings of the IEEE International Conference on Data Engineering, (the IEEE International Conference on Data Engineering) 1990.
Notes
1
© 2018 Global JournalsIncremental Maintenance of a Materialized View in Data Warehousing: An Effective Approach
Date: 2018-01-15