OSSM: Ordered Sequence Set Mining for Maximal Length Frequent Sequences A Hybrid Bottom-Up-Down Approach

Table of contents

1. Introduction

esearchers feel enthusiastic on the sequential pattern mining problems and wide range of possibilities of applications regarding the envisaging of the customer buying patterns and scientific discoveries [1,2,3,4,5] discussed by Agrawal and Srikant [1]. Let us explain with an example like finding the given time stamped sequences of purchase made by a customer. In this example the main objective is to find sequence of same time stamped list of items purchased by the customer. So the algorithm of sequence pattern mining should concentrate on finding the repeated sequences which are called as frequent sequence. Such sequences list out the frequency of common occurrences. several heuristics like GSP [1], SPADE [3], Prefix Span [2] and the SPIRIT [4] attempt to find the frequent patterns in productive method by striving to cut short a series, hence decrease search space. The GSP algorithm [1] utilizes the anti-monotone property (all subsequences of a frequent sequence are also frequent).

The SPADE finds frequent sequences using the lattice search [3] and intersection based approach. In this particular method the sequence database is converted into a vertical format. The candidate sequences will be made into different groups. These frequent sequences will be listed in SPADE utilizing two methods namely breadth first method and depth first method. The base is calculated for the produced sequences. The approach of mentioned three algorithms can be grouped as the candidate-production with a base evaluation. The PrefixSpan [2] algorithm adopts growth method pattern. It utilizes recorded database to accomplish.. Prefix is Projected Sequential Pattern mining which checks prefix subsequences and includes the postfix sub sequences into the databases.

2. II.

3. Related work

The sequential item set mining problem was initiated by Agrawal and Srikant, and the same developed a filtered algorithm, GSP [1], based on the Apriori property [1]. Since then, lots of sequential item set mining algorithms are being developed to increase the efficiency. Some are SPADE [3], PrefixSpan [2], and SPAM [11]. SPADE [3] is on principle of vertical id-list format and it uses a lattice-theoretic method to decompose the search space into many tiny spaces, on the other hand PrefixSpan [2] implements a horizontal format dataset representation and mines the sequential item sets with the pattern-growth paradigm: grow a prefix item set to attain longer sequential item sets on building and scanning its database. The SPADE [3] and the PrefixSPan [2] highly perform GSP [1]. SPAM [11] is a recent algorithm used for mining lengthy sequential item sets and implements a vertical bitmap representation. Its observations reveal, SPAM [11] is more efficient in mining long item sets compared to SPADE [3] and PrefixSpan [2] but, it still takes more space than SPADE [3] and PrefixSpan [2]. Since the frequent closed item set mining [12], many capable frequent closed item set mining algorithms are introduced, like A-Close [12], CLOSET [13], CHARM [14], and CLOSET+ [15]. Many such algorithms are to maintain the ready mined frequent closed item sets to attain item set closure checking. To decrease the memory usage and search space for item set closure checking, two algorithms, TFP [17] and CLOSET+2, implement a compact 2-level hash indexed result-tree structure to keep the readily mined frequent closed item set candidates. Some pruning methods and item set closure verifying methods, initiated that can be extended for optimizing the mining of closed sequential item sets also. CloSpan is a new algorithm used for mining frequent closed sequences [16]. It goes by the candidate maintenance-and-test method: initially create a set of closed sequence candidates stored in a hash indexed result-tree structure and do post-pruning on it. It requires some pruning techniques such as Common Prefix and Backward Sub-Itemset pruning to prune the search space as CloSpan [16] requires maintaining the set of closed sequence candidates, it consumes much memory leading to heavy search space for item set closure checking when there are more frequent closed sequences. Because of which, it does not scale well the number of frequent closed sequences.

4. III.

Sequence, sub sequence and frequent sequences

We can say a sequence means an ordered set of events [1], set of events S is said to subsequence of 2 S . The sequence database S is a set of the form (tid, s) where tid is the transaction-id and s is the sequence generated from transaction.

Let the minimum support as a threshold defined by user which indicates the desired minimum occurrences of the sequence to be claimed as frequent.

A sequence

s j is lengthiest if ? ? s ! {s S|i ={1..m}} j i

. We explain the maximal support of an event e which consists of items

m i i MSF s MS e = = ?

The Maximal Support Factor is the threshold used in our proposal to minimize the subsequence search in top to bottom approach.

We apply an ordered search on sequence database, hence the sequence database will be ordered in descending manner by MSF of sequences. Then the search for super sequence of a given sequence is limited to the sequences with greater or equal MSF of the given sequence.

The preprocessed dataset with the sorted transaction list has several properties that can be used to cut short the search space which are hypothesized below: Ossm overview T is a transaction represented by transaction id tid and s I is an itemset that belongs to set I of frequent itemsets. ? Build sequences by grouping items as events in a given transaction. Here we follow the top-to-bottom approach to build events. First we build events based on maximal length itemsets and continue the process in descending order of the itemset lengths.

| | max ( ) { | ( ) 1 0} 1 SM Occurrence s i if s s i or i c c i i = ? = = ? = ? Here | | SM is set of

? Measure the Maximal support of each event e of the given transaction tid T (refer section 3 for measuring maximal support MS for an event e of transaction tid T ).

? Measure the Maximal Support Factor MSF of each sequence tid S (Refer section 3 for measuring Maximal Support Factor MSF of sequence tid S )

? Order the sequences in sequence dataset S in descending manner of their MSF. ? Build weighted acyclic directed graph from frequent itemsets of length two. o Elements of the itemset considered as vertices o Support of that itemset considered as edge weight ? Apply WFI algorithm to find critical path between any two items which represents the maximal sequence between two elements opted. ? Apply all four properties that are hypothesized in section 3 to discard, prune sequences or select maximal length sequences.

V.

OSSM: Top-Down Ordered Sequence Set Mining for maximal length sequences a) Concurrent Edge Prevision and Rear Edge Pruning (CEG&REP) [7] in OSSM i.

5. Preprocess

Dataset preprocessing and itemsets Database initialization is performed by us as the first stage of proposal. As we find itemsets with single element, we in parallel prune it with the itemsets of single elements if the support of the selected itemsets is less than the required support.

ii.

6. Concurrent Edge Prevision

In this phase, we select all itemsets from given itemset database as input in parallel. Then we start projecting edges from each selected itemset to all possible elements. The first iteration includes the pruning process in parallel, from the second iteration onwards this pruning is not required, which we claimed as an efficient process compared to other similar techniques like BIDE [8]. In first iteration, we project an itemset p s that spawned from selected itemset i s from S DB and an element i e considered from 'I'. iii.

7. Rear Edge Pruning

If any of ' ( ) ( ) ts p ts p f s f s ? that edge will be pruned and all disjoint graphs except p s will be considered as closed sequence and moves it into S DB and remove all disjoint graphs from memory. The termination of above process do not take place till the graph becomes empty, i.e. till the elements which are connected through transitive edges and projecting itemsets are available in the memory.

8. b) Building Sequence Set from Transaction Dataset

Here in this section we explore the process of building sequence dataset.

TD is the given transaction dataset I is the set of closed frequent itemsets of length 1 to m . Here m is the maximal length of the itemset.

? Initially set of closed frequent itemsets is ordered in descending manner by itemset length. ? For each transaction tid T in the given transaction dataset TD o Build events based on the closed frequent itemsets of the set I such that the event lengths will be decided in the descending order of the frequent itemset length.

o Initially events with length of m determined that is maximal length of the frequent itemsets. Then As a part of the OSSM, we order the sequence dataset in the descending manner of the Maximal support Factor (refer section 3 for details).

Let S be the determined sequence dataset from the given transaction dataset TD and set of frequent itemsets I . In this phase of OSSM, we explore the building of a weighted acyclic directed graph. Let 2 I be the set of frequent itemsets of length 2.

Let G be the graph initially with vertex count of zero | | 0 V = and edge count of zero | | 0 E = . Here V is vertex set and E is edge set. In this phase we apply WFI algorithm [9,10]. In the first pass of algorithm we try to identify and evaluate potential long and rich candidates. The rich sequences are the one whose constituent 2-sequences have high support. In the directed graph, the 2-sequenece frequencies are represented by the edge weights; we can easily compute the path with the highest weights between all pairs of nodes. Here we use WFI algorithm [9,10] for the purpose finding critical path(a path with maximal vertex count) with maximal weights.

9. f) Sequence Evolution

Each critical path generated from the graph G will be considered as candidate sequence and stored in candidate sequence set css g) Verifying Frequency of Candidate Sequence

Figure 1. ?
order sequences in descending manner for each sequence MSF is greater or equal to Hypothesis 4: Let c s be sequence such that maxOccurrence( ) c s ms < then discard all supersets of c s .Since c s is infrequent then its super sequences also infrequent. Global Journal of Computer Science and Technology Volume XII Issue VII Version I April IV.
Figure 2.
1..
sequence 1 S is said to be a subsequence of another
sequence S 2 if and only if 1 s i ? s 2 e i
for = i {1, 2,3, .m | ? 1 e e e 2 3 < < ...... < i e ...... < n i } and
i e is event of 2 S .The 2 S is said to super sequence of
1 S and 1
1 2 3 S s s s s s j = j 1 + n s i s i = ? j j + n
And every event i
i s = i i i m i = e i e ? m , here
Note: { , , ,.... , ,..... | ! {1, 2,3, , 1,.... }} s is considered as an item set, which is a non-empty, unordered, finite set of items, which can be represented as 1 2 3 { , , ,......, | ! {1, 2,3, 4,...., }} e i for e m =
Figure 3. ?
o tid i T ? and { | s i I for s ? {1..... | |} I =
o Here tid
Figure 4.
? For each itemset s d i ? of 2 I build an edge ed in
graph G
o Let consider item s d s i ? ? as source vertex, and
add item s to vertex set V . Increment | | V by 1.
o Let consider item s d d i ? ? as destination vertex,
and add item d to vertex setV . Increment | | V by
1.
o by 1.
o Add support of s d i ? as weight to edge ed .
e) Finding Critical Paths between two Items as
Maximal Length Sequences
Figure 5. ?
o Find ( ) MSF cs
o For each sequence s such that
{ s S MSF s MSF cs | ( ) ( )} ? ? (refer hypothesis 1in
section 3)
? If cs s ? then increment sup( ) sc by 1
o If sup( ) cs st ? (refer the hypothesis 2 in section 3)
then
? ? For each candidate sequence ' cs of candidate
sequence set css such that ' cs cs ?
? If ' cs cs ? then consider ' cs as frequent (refer
hypothesis 3 in section 3) and move ' cs from css to
fss
o Else
? For each candidate sequence ' cs of candidate
sequence set css such that ' cs cs ?
? If cs cs ? ' then consider ' cs as not frequent and
prune it from css (refer hypothesis 4 in section 3)
h) Finding Maximal Length Sequences
Let fss be the frequent sequence set generated
in previous phase
? is true
o For each frequent sequence ' fs such that
{ ' fs fss MSF fs | ( ') ? ? ( )} MSF fs and fs fs ? '
and sts is true
? If fs fs ? ' then set sts as false
o
Note: Finally, maximal length frequent sequence set mlfss contains sequences that are not subsequence of any other frequent sequences.Global Journal of Computer Science and Technology Volume XII Issue VII Version I
50
60

Appendix A

Appendix A.1

VI.

Appendix A.2 Conclusion

The proposed ordered sequence set mining (OSSM) approach is a scalable and optimal because of its hybrid bottom-up-down approach. OSSM is supported by our earlier Concurrent Edge Prevision and Rear Edge Pruning (CEG&REP) [7] for frequent closed itemset mining, which was proven as efficient in memory usage and scalable on dense datasets. A novel mechanism of sequence dataset generation from transaction dataset is introduced in this paper. The proposed OSSM is capable to generate the longest candidate sequence by weighted acyclic directed graph construction and also efficient and scalable to find frequent sequence set and maximal length frequent sequence set due to top-down approach. To compute the support for a candidate sequence, it uses the maximal support factor of sequences. And the order approach that ordering the sequence dataset in descending order of MSF ensures that whole data set is not scanned. Also, if the data set contains long regular sequences, it is identified early enough and thus all the subsequences of this is also marked regular and need not be evaluated. The longest possible sequence is build up by bottom up algorithms starting from 2sequence.

Appendix B

  1. Antunes , A Oliveira . Generalization of Pattern-Growth Methods for Sequential Pattern Mining with Gap Constraints" in Int'l Conf Machine Learning and Data Mining, 2003. p. .
  2. Concurrent Edge Prevision and Rear Edge Pruning Approach for Frequent Closed Itemset Mining"; (IJACSA). Anurag Choubey , Dr , Dr J L Patel , Rana . International Journal of Advanced Computer Science and Applications 2011. 2 (11) .
  3. Pincer Search: A new algorithm for discovering the maximum frequent set. D I Lin , Kedem , Z . International conference on Extending database technology, 1998.
  4. Mining Top-K Frequent Closed Patterns without Minimum Support. J Han , J Wang , Y Lu , P Tzvetkov . ICDM'02, (Maebashi, Japan
    ) Dec. 2002.
  5. BIDE: Efficient Mining of Frequent Closed Sequences, Jianyong Wang , Jiawei Han . 2004. p. .
  6. Constraint-based sequential pattern mining in large databases. J Pei , J Han , W Wang . CIKM'02, (McLean, VA
    ) Nov. 2002.
  7. Mining Sequential Patterns with Regular Expression Constraints. M Garofalakis , R Rastogi , K Shim . IEEE Transactions on Knowledge and Data Engineering 2002. 14 (3) p. .
  8. SPADE: An Efficient Algorithm for Mining Frequent Sequences, Machine Learning, v.42 n.1-2, M J Zaki . January-February 2001. p. .
  9. SPADE: An Efficient Algorithm for Mining Frequent Sequences. M Zaki . Machine Learning, 2001. Kluwer Academic Pulishers. 42 p. .
  10. CHARM: An efficient algorithm for closed itemset mining. M Zaki , C Hsiao . SDM'02, (Arlington, VA
    ) April 2002.
  11. Discoving frequent closed itemsets for association rules. N Pasquier , Y Bastide , R Taouil , L . ICDT'99, (Jerusalem, Israel
    ) Jan. 1999.
  12. Automated Structure-based Prediction of Functional Sites in Proteins: Applications to Assessing the Validity of Inheriting Protein Function From Homology in Genome Annotation and to Protein Docking. P Aloy , E Querol , F X Aviles , M J E Sternberg . Journal of Molecular Biology 2002. p. 311.
  13. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. Pei J Han , J . Int'l Conf Data Engineering, 2001. p. .
  14. R Agrawal , R Srikant . Mining Sequential Patterns: Generalizations and Performance Improvements Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology p, 1996. p. .
  15. Algorithm 97: Shortest path. R Floyd . Communications of the ACM 1962. 5.
  16. R Kohavi , C Brodley , B Frasca , L Mason , Z Zheng . KDD-cup 2000 organizers' report: Peeling the Onion. SIGKDD Explorations, 2000. 2.
  17. A theorem on boolean matrices. S Warshall . Journal of the ACM 1962. 9.
Notes
50.
© 2012 Global Journals Inc. (US)
60.
© 2012 Global Journals Inc. (US)
Date: 2012-05-15