# Introduction

esearchers feel enthusiastic on the sequential pattern mining problems and wide range of possibilities of applications regarding the envisaging of the customer buying patterns and scientific discoveries [1,2,3,4,5] discussed by Agrawal and Srikant [1]. Let us explain with an example like finding the given time stamped sequences of purchase made by a customer. In this example the main objective is to find sequence of same time stamped list of items purchased by the customer. So the algorithm of sequence pattern mining should concentrate on finding the repeated sequences which are called as frequent sequence. Such sequences list out the frequency of common occurrences. several heuristics like GSP [1], SPADE [3], Prefix Span [2] and the SPIRIT [4] attempt to find the frequent patterns in productive method by striving to cut short a series, hence decrease search space. The GSP algorithm [1] utilizes the anti-monotone property (all subsequences of a frequent sequence are also frequent).

The SPADE finds frequent sequences using the lattice search [3] and intersection based approach. In this particular method the sequence database is converted into a vertical format. The candidate sequences will be made into different groups. These frequent sequences will be listed in SPADE utilizing two methods namely breadth first method and depth first method. The base is calculated for the produced sequences. The approach of mentioned three algorithms can be grouped as the candidate-production with a base evaluation. The PrefixSpan [2] algorithm adopts growth method pattern. It utilizes recorded database to accomplish.. Prefix is Projected Sequential Pattern mining which checks prefix subsequences and includes the postfix sub sequences into the databases.


# II.


# Related work

The sequential item set mining problem was initiated by Agrawal and Srikant, and the same developed a filtered algorithm, GSP [1], based on the Apriori property [1]. Since then, lots of sequential item set mining algorithms are being developed to increase the efficiency. Some are SPADE [3], PrefixSpan [2], and SPAM [11]. SPADE [3] is on principle of vertical id-list format and it uses a lattice-theoretic method to decompose the search space into many tiny spaces, on the other hand PrefixSpan [2] implements a horizontal format dataset representation and mines the sequential item sets with the pattern-growth paradigm: grow a prefix item set to attain longer sequential item sets on building and scanning its database. The SPADE [3] and the PrefixSPan [2] highly perform GSP [1]. SPAM [11] is a recent algorithm used for mining lengthy sequential item sets and implements a vertical bitmap representation. Its observations reveal, SPAM [11] is more efficient in mining long item sets compared to SPADE [3] and PrefixSpan [2] but, it still takes more space than SPADE [3] and PrefixSpan [2]. Since the frequent closed item set mining [12], many capable frequent closed item set mining algorithms are introduced, like A-Close [12], CLOSET [13], CHARM [14], and CLOSET+ [15]. Many such algorithms are to maintain the ready mined frequent closed item sets to attain item set closure checking. To decrease the memory usage and search space for item set closure checking, two algorithms, TFP [17] and CLOSET+2, implement a compact 2-level hash indexed result-tree structure to keep the readily mined frequent closed item set candidates. Some pruning methods and item set closure verifying methods, initiated that can be extended for optimizing the mining of closed sequential item sets also. CloSpan is a new algorithm used for mining frequent closed sequences [16]. It goes by the candidate maintenance-and-test method: initially create a set of closed sequence candidates stored in a hash indexed result-tree structure and do post-pruning on it. It requires some pruning techniques such as Common Prefix and Backward Sub-Itemset pruning to prune the search space as CloSpan [16] requires maintaining the set of closed sequence candidates, it consumes much memory leading to heavy search space for item set closure checking when there are more frequent closed sequences. Because of which, it does not scale well the number of frequent closed sequences.


# III.

Sequence, sub sequence and frequent sequences

We can say a sequence means an ordered set of events [1], set of events S is said to subsequence of 2 S . The sequence database S is a set of the form (tid, s) where tid is the transaction-id and s is the sequence generated from transaction.

Let the minimum support as a threshold defined by user which indicates the desired minimum occurrences of the sequence to be claimed as frequent.

A sequence
s j is lengthiest if ? ? s ! {s S|i ={1..m}} j i
. We explain the maximal support of an event e which consists of items 
m i i MSF s MS e = = ?
The Maximal Support Factor is the threshold used in our proposal to minimize the subsequence search in top to bottom approach.

We apply an ordered search on sequence database, hence the sequence database will be ordered in descending manner by MSF of sequences. Then the search for super sequence of a given sequence is limited to the sequences with greater or equal MSF of the given sequence.

The preprocessed dataset with the sorted transaction list has several properties that can be used to cut short the search space which are hypothesized below:  Ossm overview T is a transaction represented by transaction id tid and s I is an itemset that belongs to set I of frequent itemsets. ? Build sequences by grouping items as events in a given transaction. Here we follow the top-to-bottom approach to build events. First we build events based on maximal length itemsets and continue the process in descending order of the itemset lengths.
| | max ( ) { | ( ) 1 0} 1 SM Occurrence s i if s s i or i c c i i = ? = = ? = ? Here | | SM is set of
? Measure the Maximal support of each event e of the given transaction tid T (refer section 3 for measuring maximal support MS for an event e of transaction tid T ).

? Measure the Maximal Support Factor MSF of each sequence tid S (Refer section 3 for measuring Maximal Support Factor MSF of sequence tid S )

? Order the sequences in sequence dataset S in descending manner of their MSF. ? Build weighted acyclic directed graph from frequent itemsets of length two. o Elements of the itemset considered as vertices o Support of that itemset considered as edge weight ? Apply WFI algorithm to find critical path between any two items which represents the maximal sequence between two elements opted. ? Apply all four properties that are hypothesized in section 3 to discard, prune sequences or select maximal length sequences.

V.

OSSM: Top-Down Ordered Sequence Set Mining for maximal length sequences a) Concurrent Edge Prevision and Rear Edge Pruning (CEG&REP) [7] in OSSM i.


# Preprocess

Dataset preprocessing and itemsets Database initialization is performed by us as the first stage of proposal. As we find itemsets with single element, we in parallel prune it with the itemsets of single elements if the support of the selected itemsets is less than the required support.

ii.


# Concurrent Edge Prevision

In this phase, we select all itemsets from given itemset database as input in parallel. Then we start projecting edges from each selected itemset to all possible elements. The first iteration includes the pruning process in parallel, from the second iteration onwards this pruning is not required, which we claimed as an efficient process compared to other similar techniques like BIDE [8]. In first iteration, we project an itemset p s that spawned from selected itemset i s from S DB and an element i e considered from 'I'. iii.


# Rear Edge Pruning

If any of ' ( ) ( ) ts p ts p f s f s ? that edge will be pruned and all disjoint graphs except p s will be considered as closed sequence and moves it into S DB and remove all disjoint graphs from memory. The termination of above process do not take place till the graph becomes empty, i.e. till the elements which are connected through transitive edges and projecting itemsets are available in the memory.


# b) Building Sequence Set from Transaction Dataset

Here in this section we explore the process of building sequence dataset.

TD is the given transaction dataset I is the set of closed frequent itemsets of length 1 to m . Here m is the maximal length of the itemset.

? Initially set of closed frequent itemsets is ordered in descending manner by itemset length. ? For each transaction tid T in the given transaction dataset TD o Build events based on the closed frequent itemsets of the set I such that the event lengths will be decided in the descending order of the frequent itemset length.

o Initially events with length of m determined that is maximal length of the frequent itemsets. Then As a part of the OSSM, we order the sequence dataset in the descending manner of the Maximal support Factor (refer section 3 for details).

Let S be the determined sequence dataset from the given transaction dataset TD and set of frequent itemsets I . In this phase of OSSM, we explore the building of a weighted acyclic directed graph. Let 2 I be the set of frequent itemsets of length 2.

Let G be the graph initially with vertex count of zero | | 0 V = and edge count of zero | | 0 E = . Here V is vertex set and E is edge set. In this phase we apply WFI algorithm [9,10]. In the first pass of algorithm we try to identify and evaluate potential long and rich candidates. The rich sequences are the one whose constituent 2-sequences have high support. In the directed graph, the 2-sequenece frequencies are represented by the edge weights; we can easily compute the path with the highest weights between all pairs of nodes. Here we use WFI algorithm [9,10] for the purpose finding critical path(a path with maximal vertex count) with maximal weights.


# f) Sequence Evolution

Each critical path generated from the graph G will be considered as candidate sequence and stored in candidate sequence set css g) Verifying Frequency of Candidate Sequence 
![order sequences in descending manner for each sequence MSF is greater or equal to Hypothesis 4: Let c s be sequence such that maxOccurrence( ) c s ms < then discard all supersets of c s .Since c s is infrequent then its super sequences also infrequent. Global Journal of Computer Science and Technology Volume XII Issue VII Version I April IV.](image-2.png "?")


1..sequence 1 S is said to be a subsequence of anothersequenceS2ifandonlyif1 s i?s2e ifor = i {1, 2,3, .m | ?1 e e e 2 3 < <......<i e......<n i}andi e is event of 2 S .The 2 S is said to super sequence of1 S and 11 2 3 S s s s s s j =j1 +n si s i = ?j j+nAnd every event ii s =i i im i = e i e ?m, here{ , , ,.... , ,..... | ! {1, 2,3, , 1,.... }} s is considered as an item set, which is a non-empty, unordered, finite set of items, which can be represented as 1 2 3 { , , ,......, | ! {1, 2,3, 4,...., }} e i for e m =
otid i T ? and{ | s i I for s ?{1..... | |} I =o Here tid
? For each itemset s d i ? of 2 I build an edge ed ingraph Go Let consider items d s i ? ?as source vertex, andadd item s to vertex set V . Increment | | V by 1.o Let consider items d d i ? ?as destination vertex,and add item d to vertex setV . Increment | | V by1.o by 1.o Add support of s d i ? as weight to edge ed .e) Finding Critical Paths between two Items asMaximal Length Sequences
o Find( ) MSF cso Foreachsequencessuchthat{ s S MSF s MSF cs | ( ) ( )} ? ?(refer hypothesis 1insection 3)? If cs s ? then increment sup( ) sc by 1o If sup( ) cs st ? (refer the hypothesis 2 in section 3)then? ? For each candidate sequence' cs of candidatesequence set css such that ' cs cs ?? If' cs cs ? then consider' cs as frequent (referhypothesis 3 in section 3) and move' cs from css tofsso Else? For each candidate sequence' cs of candidatesequence set css such that ' cs cs ?? Ifcs cs ?'then consider' cs as not frequent andprune it from css (refer hypothesis 4 in section 3)h) Finding Maximal Length SequencesLet fss be the frequent sequence set generatedin previous phase? is trueo For each frequent sequence' fs such that{ ' fs fss MSF fs | ( ') ??( )} MSF fsandfs fs ?'and sts is true? Iffs fs ?'then set sts as falseoFinally, maximal length frequent sequence set mlfss contains sequences that are not subsequence of any other frequent sequences.Global Journal of Computer Science and Technology Volume XII Issue VII Version I
			© 2012 Global Journals Inc. (US)
			© 2012 Global Journals Inc. (US)
		
		
VI.


## Conclusion

The proposed ordered sequence set mining (OSSM) approach is a scalable and optimal because of its hybrid bottom-up-down approach. OSSM is supported by our earlier Concurrent Edge Prevision and Rear Edge Pruning (CEG&REP) [7] for frequent closed itemset mining, which was proven as efficient in memory usage and scalable on dense datasets. A novel mechanism of sequence dataset generation from transaction dataset is introduced in this paper. The proposed OSSM is capable to generate the longest candidate sequence by weighted acyclic directed graph construction and also efficient and scalable to find frequent sequence set and maximal length frequent sequence set due to top-down approach. To compute the support for a candidate sequence, it uses the maximal support factor of sequences. And the order approach that ordering the sequence dataset in descending order of MSF ensures that whole data set is not scanned. Also, if the data set contains long regular sequences, it is identified early enough and thus all the subsequences of this is also marked regular and need not be evaluated. The longest possible sequence is build up by bottom up algorithms starting from 2sequence.
			
			
* 
	
		
			RAgrawal
		
		
			RSrikant
		
		Mining Sequential Patterns: Generalizations and Performance Improvements Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology p
				
			1996
			
		
* 
	
		PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth
		
			PeiJHan
		
		
			J
		
	
		Int'l Conf Data Engineering
				
			2001
			
		
* 
	
		SPADE: An Efficient Algorithm for Mining Frequent Sequences, Machine Learning, v.42 n.1-2
		
			MJZaki
		
		
			January-February 2001
			
		
* 
	
		Mining Sequential Patterns with Regular Expression Constraints
		
			MGarofalakis
		
		
			RRastogi
		
		
			KShim
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			14
			3
			
			2002
		
	
* 
	
		
			Antunes
		
		
			AOliveira
		
		Generalization of Pattern-Growth Methods for Sequential Pattern Mining with Gap Constraints" in Int'l Conf Machine Learning and Data Mining
				
			2003
			
		
* 
	
		Pincer Search: A new algorithm for discovering the maximum frequent set
		
			DILin
		
		
			Kedem
		
		
			Z
		
	
		International conference on Extending database technology
				
			1998
		
	
* 
	
		Concurrent Edge Prevision and Rear Edge Pruning Approach for Frequent Closed Itemset Mining"; (IJACSA)
		
			AnuragChoubey
		
		
			Dr
		
		
			DrJ LPatel
		
		
			Rana
		
	
		International Journal of Advanced Computer Science and Applications
		
			2
			11
			2011
		
	
* 
	
		BIDE: Efficient Mining of Frequent Closed Sequences
		
			JianyongWang
		
		
			JiaweiHan
		
		
			2004
			
		
* 
	
		Algorithm 97: Shortest path
		
			RFloyd
		
	
		Communications of the ACM
		
			5
			1962
		
	
* 
	
		A theorem on boolean matrices
		
			SWarshall
		
	
		Journal of the ACM
		
			9
			1962
		
	
* 
	
		SPADE: An Efficient Algorithm for Mining Frequent Sequences
		
			MZaki
		
	
		Machine Learning
				
			Kluwer Academic Pulishers
			2001
			42
			
		
* 
	
		Discoving frequent closed itemsets for association rules
		
			NPasquier
		
		
			YBastide
		
		
			RTaouil
		
		
			L
		
	
		ICDT'99
				Jerusalem, Israel
		
			Jan. 1999
		
	
* 
	
		CHARM: An efficient algorithm for closed itemset mining
		
			MZaki
		
		
			CHsiao
		
	
		SDM'02
				Arlington, VA
		
			April 2002
		
	
* 
	
		
			RKohavi
		
		
			CBrodley
		
		
			BFrasca
		
		
			LMason
		
		
			ZZheng
		
		KDD-cup 2000 organizers' report: Peeling the Onion. SIGKDD Explorations
				
			2000
			2
		
	
* 
	
		Mining Top-K Frequent Closed Patterns without Minimum Support
		
			JHan
		
		
			JWang
		
		
			YLu
		
		
			PTzvetkov
		
	
		ICDM'02
				Maebashi, Japan
		
			Dec. 2002
		
	
* 
	
		Automated Structure-based Prediction of Functional Sites in Proteins: Applications to Assessing the Validity of Inheriting Protein Function From Homology in Genome Annotation and to Protein Docking
		
			PAloy
		
		
			EQuerol
		
		
			FXAviles
		
		
			MJ ESternberg
		
	
		Journal of Molecular Biology
		
			311
			2002
		
	
* 
	
		Constraint-based sequential pattern mining in large databases
		
			JPei
		
		
			JHan
		
		
			WWang
		
	
		CIKM'02
				McLean, VA
		
			Nov. 2002