# Introduction nce any company decided to use the data mining system for daily operations, management will be concerned with the system performance for their environment. If the mining take place on the historic data then the result could be used for the future strategic decision making [1]. But the amount of data collected over time is increased in daily basis in the database then that can reduce the accuracy of the result [2] of the mining process. This is where online data mining can play a vital role to improve the mining result and its accuracy [3]. # a) Frequent Itemset Mining Approaches The tentative nature of frequent itemset mining normally results in a large number of frequent itemset generations. The increase in the number of frequent itemset generated will result in the degradation of mining efficiency. The frequent closed itemset [9] mining is the solution for the above said problem. The FCI is a non redundant representation of the set of frequent itemsets [10]. The commendable reduction in the size of the result set leads to improved performance in the speed and memory usage. Different efficient FCI algorithms [8,11,12] are proposed by different authors. We noticed that the FCIs approach could not be applied over land mark window since the number of FCIs approaches that of frequent itemsets when the window becomes very large. Author ? : Computer Science and Information Technology Department at Ravishankar Shukla University, Raipur, C.G. and working as Associate Professor in Information Technology Department in Christian College of Engineering and Technology, Bhilai, C.G, India. E-mail : pramodsnair@yahoo.com Author ? : Professor in Indian Institute of Information Tecnology-Allahabad, U.P., India. E-mail : dropvyas@gmail.com There is one another frequent item set mining technique called Frequent Maximal Item set [7]. Compare with the Frequent Closed Item set mining technique it will generate comparatively less number of item sets, due to this reason it is significantly more efficient [13] in terms of both CPU and memory. But the disadvantage of FMI mining is that it lose the frequency information of the subset of FMIs so the error bound will also increased. There are many concise representations of frequent item sets are proposed [14,15,16,17,18,19], these are significantly saving memory space, CPU and shows better accuracy. This technique could be applied in stream mining with the efficient incremental technique and batch processing. # II. # WINDOWING APPROACH TO DATA STREAM MINING One of the main issues in the stream data mining is to find out a model which will suit the extraction process of the frequent item set from the streaming in data. There are three stream data processing model [20] that are Landmark window, Damped window and Sliding window model. A transaction data stream is a sequence of incoming transactions and an excerpt of the stream is called a window. A window, W, can be either time-based or count-based, and either a landmark window or a sliding window. W is time-based if W consists of a sequence of fixed-length time units, where a variable number of transactions may arrive within each time unit. W is count-based if W is composed of a sequence of batches, where each batch consists of an equal number of transactions. W is a landmark window if W = (T 1 , T 2 , . . . , T); W is a sliding window if W = (T T?w+1 , . . . , T T ), where each T i is a time unit or a batch, T 1 and T T are the oldest and the current time unit or batch, and w is the number of time units or batches in the sliding window, depending on whether W is time-based or count-based. Note that a count-based window can also be captured by a time-based window by assuming that a uniform number of transactions arrive within each time unit. The frequency of an item set, X, in W, denoted as freq(X), is the number of transactions in W support X. The support of X in W, denoted as sup(X), is defined as freq(X)/N, where N is the total number of transactions received in W. X is a Frequent Item set (FI) in W, if sup(X) ? ?, where ? (0 ? ? ? 1) is a user-specified ( D D D D ) C minimum support threshold. X is a Frequent Maximal Item set (FMI) in W, if X is an FI in W and there exists no item set Y in W such that X ? Y . X is a Frequent Closed Item set (FCI) in W, if X is an FI in W and there exists no item set Y in W such that X ? Y and freq(X) = freq(Y). # a) Landmark Window Concept In this section we will discuss some of important land mark window algorithms. One of the algorithm proposed by Manku and Motwani [23] is a lossy counting approximation algorithm. It will compute the approximate set of frequent item sets over the entire stream so far. In this algorithm the stream is divided into sequence of buckets. The lossy counting algorithm processes a batch of transactions arriving at a particular time. In this paper they are maintaining the item set, the frequency of item set and the error as the upper bound of the frequency of the item set. This algorithm uses three different modules, Buffer, Trie and Set Gen. The Buffer module keeps filling the available memory with the incoming transactions. This module frequently computes the frequency of every item in the current transactions and prune if it is less than N. The Trie module maintains set D, as a forest of prefix trees. The Trie forest as an array of tuples (X, freq(X), err (X), level ) that correspond to the pre-order traversal of the forest, where the level of a node is the distance of the node from the root. The Trie array is maintained as a set of chunks. On updating the Trie array, a new Trie array is created and chunks from the old Trie are freed as soon as they are not required. All the item sets in the current batch having the support will be generated by the Set Gen module. The Apriori-like pruning [21] will help to avoid the generation of superset of an item set if the frequency less than ? in the current batch. The Set Gen implemented with the help of Heap queue. Set Gen repeatedly processes the smallest item in Heap to generate a 1-itemset. If this 1itemset is in Trie after the Add Entry or the Update Entry operation is utilized, Set Gen is recursively invoked with a new Heap created out of the items that follow the smallest items in the same transactions. During each call of Set Gen, qualified old item sets are copied to the new Trie array according to their orders in the old Trie array, while at the same time new item sets are added to the new Trie array in lexicographic order. When the recursive call returns, the smallest entry in Heap is removed and the recursive process continues with the next smallest item in Heap. The quality of the approximation mining results by using the relaxed minimum support threshold ? leads to the extra usage of memory and the processing power. That is, the smaller relaxed minimum support leads to increase of number of sub-FIs generated, so the increase of memory and the extra usage of processing power. , if ? approaches ?, more falsepositive answers will be included in the result, since all sub-FIs whose computed frequency is at least (? ? ? )N ? 0 are displayed while the computed frequency of the sub-FIs can be less than their actual frequency by as much as ?N. The same problem is in other mining algorithms [21,22,23,24,13,4] that use a relaxed minimum support threshold to control the accuracy of the mining result. One of the algorithm called DSM-FI developed by Li [13], is to mine an approximate set of FIs over the entire history of the stream. This algorithm is used a prefix-tree based in memory data structure. DSM-FI is also using the relaxed minimum support threshold and all the generated FIs are stored in the IsFI-forest. The DSM-FI consists of Header Table(HT) and Sub-Frequent Itemsets tree(SFI-tree). For every unique item in the set of sub-FIs it inserts an entry with frequency, batch id and head link, it increments otherwise. The DSM-FI frequently prunes the items that are not satisfied the minimum support. One of the approximation algorithm developed by Lee [4] used the compressed prefix tree structure called CP-tree. The structure of the CP-tree is described as follows. Let D be the prefix tree used in estDec. Given a merging gap threshold ?, where 0 ? ? ? 1, if all the itemsets stored in a subtree S of D satisfy the following equation, then S is compressed into a node in the CPtree. # T ( ) ( ) N T T freq X freq Y ? ? ? Where X is the root of S and Y is an item set in S. Assume S is compressed into a node v in the CPtree. The node v consists of the following four fields: item-list, parent-list, freqTmax and freqTmin where v.item-list is a list of items which are the labels of the nodes in S, v. parent-list is a list of locations (in the CPtree) of the parents of each node in S, v. freqTmax is the frequency of the root of S and freqTmin is the frequency of the right-most leaf of S. The use of the CP-tree results in the reduction of memory consumption, which is important in mining data streams. The CP-tree can also be used to mine the FIs, however, the error rate of the computed frequency of the FIs, which is estimated from freqTmin and freqTmax, will be further increased. Thus, the CP-tree is more suitable for mining FMIs. # b) Sliding Window Concept The sliding window model processes only the items in the window and maintains only the frequent item sets. The size of the sliding window can be decided according to the applications and the system resources. The recently generated transactions in the window will influence the mining result of the sliding windowing, otherwise all the items in the window to be maintained. The size of the sliding window may vary depends up on ( D D D D ) C 2012 Year the applications it may use. In this section we will discuss some of the important windowing approaches for stream mining. An in memory prefix tree based algorithm proposed by Chi [26,22] following the windowing approach to incrementally update the set of frequent closed item sets over the sliding window . The data structure used for the algorithm is called as Closed Enumeration Tree (CET) to maintain the dynamically selected set of item set over the sliding window. This algorithm will compute the exact set of frequent closed item sets over the sliding window. The updation will be for each incoming transaction but not enough to handle the handle the high speed streams. One another notable algorithm in the windowing concept is estWin [3]. This algorithm maintains the frequent item sets over a sliding window. The data structure used to maintain the item sets is prefix tree. The prefix tree holds three parameters for each items set in the tree, that are frequency of x in current window before x is inserting in the tree, that is freq(x). The second is an upper bound for the frequency of x in the current window before x is inserted in the tree, err(x). The third is the ID of the transaction being processed, tid(x).b. The item set in the tree will be pruned along with all supersets of the item set, we prune the item set X and the supersets if tid(X) ? tid 1 and freq(X) < ? ? N?, or (2) tid(X) > tid 1 and freq(X) < ? ? (N ? (tid(X) ? tid 1 ))?. The expression tid(X) > tid 1 means that X is inserted into D at some transaction that arrived within the current sliding window and hence the expression (N?(tid(X)?tid 1 ))returns the number of transactions that arrived within the current window since the arrival of the transaction having the ID tid(X). We note that X itself is not pruned if it is a 1-itemset, since estWin estimates the maximum frequency error of an itemset based on the computed frequency of its subsets [84] and thus the frequency of a 1-itemset cannot be estimated again if it is deleted. # c) Damped Window Concept In this section we will discuss some of the notable Damped window algorithms. The estDec [5] algorithm proposed to reduce the effect of the old transactions on the stream mining result. They have used a decay rate to reduce the effect of the old transactions and the resulted frequent item sets are called recent frequent Item sets. The algorithm, for maintaining recent FIs is an approximate algorithm that adopts the mechanism to estimate the frequency of the item sets. The use of a decay rate diminishes the effect of the old and obsolete information of a data stream on the mining result. However, estimating the frequency of an item set from the frequency of its subsets can produce a large error and the error may propagate all the way from the 2-subsets to the n-supersets, while the upper bound is too loose. Thus, it is difficult to formulate an error bound on the computed frequency of the resulting item sets and a large number of false-positive results will be returned, since the computed frequency of an item set may be much larger than its actual frequency. Moreover, the update for each incoming transaction (instead of a batch) may not be able to handle high-speed streams. Another approximation algorithm [6] uses a tilted time window model . In this frequency FIs are kept in different time granularities such as last one hour, last two hours, last four hours and so on. The data structure used in this algorithm is called FP-Stream. There are two components in the FP-Stream which are pattern tree based prefix tree and tilted time window which is at the end node of the path. The pattern tree can be constructed using the FP-tree algorithm [25]. The tilted time window guarantees that the granularity error is at most T/2, where T is the time units. The updation of the frequency record will be done by shifting the recent records to merge with the older records. To reduce the number of frequency records in the tilted-time windows, the old frequency records of an item set, X, are pruned as follows. Let freqj(X) be the computed frequency of X over a time unit Tj and Nj be the number of transactions received within Tj , where 1 ? j ? ? . For some m, where 1 ? m ? ?, the frequency records freq1(X), . . . ,freqm(X) are pruned if the following condition holds: ?n ? ?, ?i, 1 ? i ? n, freqi(X) < ?Ni and ?l, 1 ? l ? m ? n, ? freq j l j?1 (x)< ? N J j j=1 The FP-stream mining algorithm computes a set of sub-FIs at the relaxed minimum support threshold, ? , over each batch of incoming transactions by using the FI mining algorithm, FP-growth [25]. For each sub-FI X obtained, FP-streaming inserts X into the FP-stream if X is not in the FP-stream. If X is already in the FP-stream, then the computed frequency of X over the current batch is added to its tilted-time window. Next, pruning is performed on the tilted-time window of X and if the window becomes empty, FP-growth stops mining supersets of X by the Apriori property [2]. After all sub-FIs mined by FP-growth are updated in the FP-stream, the FP-streaming scans the FP-stream and, for each item set X visited, if X is not updated by the current batch of transactions, the most recent frequency in X's tilted-time window is recorded as 0. Pruning is then performed on X. If the tilted-time window of some item set visited is empty (as a result of pruning), the item set is also pruned from the FP-stream. The tilted-time window model allows us to answer more expressive time-sensitive queries, at the expense of some frequency record kept for each item set. The tilted-time window also places greater importance on recent data than on old data as does the # Conclusion In this paper we have discussed some of the issues of the windowing concept for the online stream mining to develop an effective, performance oriented algorithm. We also discussed some of the important windowing algorithms in the different windowing concept and reviewed, for some extend, how the existing important algorithms could handle these different issues. The further study can be done on this field to develop an effective algorithm in the data stream mining. We have discussed the way the different algorithms handle the data stream mining so that the researchers can analyze and study further for the research work. ![model; however, it does not lose the information in the historical data completely. A drawback of the approach is that the FP-stream can become very large over time and updating and scanning such a large structure may degrade the mining throughput.III.](image-2.png "") * A methodology for dynamic data mining based on fuzzy clustering FernandoCrespoa RichardWeberb Fuzzy Sets and Systems 150 2005 * Principles of Data Mining DavidHand HeikkiMannila PadhraicSmyth 2001 MIT Press Cambridge, MA * MariaHalkidi Quality assessment and Uncertainty Handling in Data Mining Process * Integrating Classification and Association Rule Mining BLiu WHsu YMa Proc. of KDD of KDD 1998 * Finding Recent Frequent Itemsets Adaptively over Online Data Streams JHChang WSLee Proc. of KDD of KDD 2003 * Mining Frequent Patterns in Data Streams at Multiple Time Granularities CGiannella JHan JPei XYan PSYu Kargupta et al.: Data Mining: Next Generation Challenges and Future Directions MIT/AAAI Press 2004 * Finding Maximal Frequent Itemsets over Online Data Streams Adaptively DLee WLee Proc. of ICDM of ICDM 2005 * Catch the Moment: Maintaining Closed Frequent Itemsets over a Data Stream Sliding Window YChi HWang PSYu RRMuntz KAIS 2006 10 * Discovering Frequent Closed Itemsets for Association Rules NPasquier YBastide RTaouil LLakhal Proc. of ICDT of ICDT 1999 * Generating Non-Redundant Association Rules MZaki Proc. of KDD of KDD 2000 * CHARM: An Efficient Algorithm for Closed Itemset Mining MZaki CJHsiao Proc. of SDM of SDM 2002 * CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets JWang JHan JPei Proc. of KDD of KDD 2003 * Efficiently Mining Maximal Frequent Itemsets KGouda MZaki Proc. of ICDM of ICDM 2001 * Mining All Nonderivable Frequent Itemsets TCalders BGoethals Proc. Of PKDD Of PKDD 2002 * Free-Sets: a Condensed Representation of Boolean Data for the Approximation of Frequency Queries JFBoulicaut ABykowski CRigotti DMKD 2003 7 * Mining Condensed Frequent-Pattern Bases JPei GDong WZou JHan In KAIS 6 5 2004 * Mining Compressed Frequent-Pattern Sets DXin JHan XYan HCheng Proc. of VLDB of VLDB 2005 * On Condensed Representations of Constrained Frequent Patterns FBonchi CLucchese KAIS 2005 9 * ?-Tolerance Closed Frequent Itemsets. To appear in JCheng YKe WNg Proc. of ICDM of ICDM 2006 * An Algorithm for In-Core Frequent Itemset Mining on Streaming Data RJin G Proc. of ICDM of ICDM 2005 * A Dynamic Data Mining Technique for Intrusion Detection Systems DLtc Bruce JCaulkins MLee Wang ACM 1-59593-059-0/05/0003 43rd ACM Southeast Conference Kennesaw, GA, USA. Copyright March 18-20, 2005. 2005 * Catch the Moment: Maintaining Closed Frequent Itemsets over a Data Stream Sliding Window YChi HWang PSYu RRMuntz KAIS 2006 10 * Discovery of Frequent Episodes in Event Sequences HMannila HToivonen AIVerkamo DMKD 1997 1 * What's Hot and What's Not: Tracking Most Frequent Items Dynamically SGraham Cormode Muthukrishnan ACM Transactions on Database Systems March 2005 * CheqingJin WeiningQian ChaofengSha JeffreyXYu Aoying Zhou; Dynamically Maintaining Frequent Items over a Data Stream; Int'l Conf. on Information and Knowledge Management 2003 * An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams; Int'l Workshop on Knowledge Discovery in Data Streams Hua-FuLi Suh-YinLee Man-KwanShan Sept. 2004