# Introduction

ata mining is one of the most dynamic emerging research in today's database technology and Artificial Intelligent research; the main aim is to discover valuable patterns from a large collection of data for users. In the transaction database, mining association rule is one of the important research techniques in data mining field. The original problem addressed by association rule mining was to find the correlation among sales of different items from the analysis of a large set of super market data. Right now, association rule mining research work is motivated by an extensive range of application areas, such as banking, manufacturing, health care, medicine, and telecommunications. There are two key issues that need to be addressed when applying association analysis.

The first one is that discovering patterns from a large dataset can be computationally expensive, thus efficient algorithms are needed. The second one is that some of the discovered patterns are potentially spurious because they may happen simply by chance. Hence, some evaluation criteria are required. Agrawal and Srikant (1994) proposed the Apriori algorithm to solve the problem of mining frequent itemsets. Apriori uses a candidate generation method, such that the frequent (k+1)-itemset in one iteration can be used to construct candidate (k+1)-itemsets for the next iteration. Apriori terminates its process when no new candidate itemsets can be generated. It is a multipass algorithm.

Unlike Apriori, the FP-growth method was proposed by Han et al. (2000) uses an FP-tree to store the frequency information of the transaction database. Without candidate generation, FP-growth uses a recursive divide-and-conquer method and the database projection approach to find the frequent itemsets. However, the recursive mining process may decrease the mining performance and raise the memory requirement.

Most of the reviews are presented in Section 2.2.A lot of algorithms were proposed to optimize the performance of the Apriori-like algorithm. In this research paper it has been presented an efficient and improved frequent pattern algorithm for mining association rules in large datasets. It is a two-pass algorithm.

The remainder of the paper is organized as follows: In Section 2, it has been described in brief an Apriori algorithm, and the relative researches of association rules. In Section 3 provides definitions for the mining method, and detailed steps on the proposed algorithm in mining frequent itemsets. An illustration is demonstrated in Section 4. In Section 5, the design of the experiment and performance analysis is discussed; finally, in Section6 offers conclusions.


# II.


# Background

At first, the data mining technique for association rule mining is the support-confidence framework established by Agrawal et  focused on the discovery of large itemsets. For description, some well-known methods and notions based on this framework is used throughout this paper. In this section it has been presented the formal statement of association rule mining and the description of Apriori algorithm and related research review.


# a) Formal statement of the problem

The following is a formal statement of association rule mining for transactional databases.

Let I = {i 1 , i 2 , i 3 , ? , i n } represents a set of 'n' distinct data items. Generally, a set of items is called an itemset, and an itemset with k items is denoted as a kitemset. Database D is a set of transactions, where the i th transaction T i denotes a set of items, such as T i ? I. |D| is the total number of transactions in D, and |T i | is the number of distinct items in transaction T i . Each transaction is associated with a unique identifier, which is termed as TID. An association rule is an implication of the form X ? Y, where X, Y ? I, and X ? Y = ?. There are measures of quality for each rule in support of itemset X ? Y and confidence of rule X ? Y. First, we need to calculate the support of itemset X ? Y, which is the ratio (denoted by s%) of the number of transactions that contain the X ? Y to |D|. Next, the confidence of rule X ? Y is the ratio (denoted by c%) of the number of transactions containing X ? Y to the number of transactions that contain X in database D. The problems of association mining rules from database D can be processed in two important steps: (1) locate all frequent itemsets whose supports are not less than the userdefined minimum support threshold ?, where ? ? (0, 1), and, (2) obtain association rules directly from these frequent itemsets with confidences not less than the user-defined minimum confidence threshold. The most time-consuming part of mining association rules is to discover frequent itemsets.


# b) Review of Apriori algorithm

In conventional Apriori-like methods, the level wise process of identifying sets of all frequent itemsets is in a combination of smaller, frequent itemsets. In the kth level, the Apriori algorithm identifies all frequent kitemsets, denoted as L k . C k is the set of candidate kitemsets obtained from L k?1 , which are suspected frequent k-itemsets. For each transaction in D, the candidate k-itemsets in C k contained within the transaction are determined, and their support count is increased by 1/|D|. Following scanning (reading) and contrasting with the entire D, when the supports of candidate k-itemsets are greater than or equal to userdefined minimum support threshold ?, they immediately become frequent k-itemsets. At the end of level k, all frequent itemsets of length k or less have been discovered. During the execution, numerous candidate itemsets are generated from single itemsets, and each candidate itemset must perform contrasts on the entire database, level by level, while searching for frequent itemsets. However, the performance is significantly affected because the database is repeatedly read to contrast each candidate itemset with all transaction records of the database.


# c) Related researches of association rules

In 1995, Savasere et al. proposed the partition algorithm to improve the efficiency of Apriori algorithm, it does so by efficiently reducing the number of scans in the database, however, considerable time is still wasted scanning infrequent candidate itemsets [3]. In 1996, Pork et al. proposed an efficient and fast algorithm called DHP (direct hashing and pruning) for the initial candidate set generation. This method efficiently controls the number of candidate 2-itemsets, pruning the size of the database [4]. In 1999, Han et  


# Proposed algorithm

The proposed algorithm improvement mainly concentrated on (1) for reducing frequent itemset and (2) for reducing storage space as well as the computing time. In the case of large datasets like Wal-Mart datasets, the proposed algorithm is very much useful for reducing the frequent patterns and also reducing the database size for every subsequent passes. For example, in the improved algorithm, the number of occurrence of frequent k-itemsets when k-itemsets are generated from (k-1)-itemsets is computed. If k is greater than the size of the database D, there is no need to scan database D which is generated by (k-1)-itemsets according to the Aprior property and it can be remove automatically.

Transposition of database: A given database as a relation between original and transposed representations of a database is defined in Table 1  At first, the given transaction database file D is transposed to database D T and count the number of item and number of transaction string for each item and sort the item numbers. Now apply Apriorilike algorithm in which first calculate the frequent transactions CT 1 . It reduces infrequent transactions and its item details. For the subsequent passes Apriori-gen has been applied and finds the subsequent frequent transactions.

Lemma 1: All the subsets of a frequent transaction must also frequent. In other words, all the supersets of a frequent transaction must also infrequent. Improved Algorithmic steps are described as below:

1. First the function apriori-gen(LT k-1 ) is called and to generate candidate k-transaction set by frequent ktransactions. 3. The occurrence of frequent k-transaction is computed by generating (k-1)-transactions from ktransactions. If k-transaction is greater than the size of database D T , it is not needed to scan database D T which is generated by (k-1)-transactions based on the lemma 1, and it can be deleted. 4. If the size of database D T is greater than or equal to k, then call function subset(CT k , d t ), which computes frequent pattern using a subsequent iterative levelwise approach based on candidate generation. (i.e., for all CT 1 is greater than or equal to minimum support.) The main advantage of the proposed algorithm for frequent patterns discovery are, it reduces the size of the database after second pass and, the storage space and saves the computing time.
LT 1 = {Large 1-transaction set (LT )}; For (k=2; LT k-1 = 0; k++) do Begin CT k = Apriori-gen(LT k-1 , ct ); // new
IV.


# Performance evaluation

The following is an example shows the processing steps of the proposed algorithm In the first iteration of the improved algorithm, all transaction sets are the member of the set of candidate 1-transactions, CT 1 . The proposed method scans all the itemsets in D T and count the number of occurrences of each itemset.


# Compute the support count with minimum support.

The user defined minimum support s is 20%, that the required support count is 2. Based on the minimum support, we can determine the set of frequent set of 1-transaction IDs(LT 1 ). That means all the candidate 1-transaction IDs are satisfied with user defined minimum support s.

3. Generate all candidate transactions of size-2 i.e., CT 2 from LT 1 and count the support count.

The algorithm generates candidate transactions CT 2 from large transaction set of size-1, LT 1 . Compute the number of occurrences in each transaction set by scanning the database D T . Accumulate the total number of sub-transaction IDs with their support count. First, combine the large transactions of size-2, LT 2 with LT 2 to determine CT 3 . Based on the lemma 1, we can determine the four letter candidate transaction IDs C 3 cannot possibly be frequent transactions and therefore prune from CT 3 . This is one of the advantages of saving time to count the number of occurrence of transaction IDs unnecessarily during the subsequent scan of D T 2 for finding LT 3 . 6. Compare the support count of candidate transaction IDs with minimum support.

The modified database D T 2 is scanned by computing LT 3 . i.e., the large transaction IDsof size-3, LT 3 are determined by computing the number of occurrences of each candidate transaction IDs CT 3 with the minimum support s.

7. Repeat the steps 4 to 6 until no more candidate transaction IDs are generated.

That is the algorithm terminates, having found all of the frequent transaction IDs. Also, it creates the modified database D T 3 , D T 4 , etc., based the size of the transaction IDs.

The following are the explanation of the proposed algorithm with an example. L 1 = {T2:6, T3:6, T4:4, T5:8, T6:6, T7:8, T8:4} Pass 2: Generate candidates for k=2 C 2 = {(T2,T3), (T2,T4), (T2,T5), (T2,T6), (T2,T7), (T2,T8), (T3,T4), (T3,T5), (T3,T6), (T3,T7), (T3,T8), (T4,T5), (T4,T6), (T4,T7), (T4,T8), (T5,T6), (T5,T7), (T5,T8), (T6,T7), (T6,T8), (  Experimental results


# C4 = { (T3 T5 T6 T7): 1 } L4 = {?}

To evaluate the efficiency and effectiveness of the improved algorithm, we performed an extensive study of two algorithms: Apriori-like and improved algorithm, on both real time synthetic data sets with different ranges. All the experiments were examined on Pentium IV machine 1GB RAM, running Microsoft Windows 7. Two algorithms, Apriori and Improved algorithm were implemented in Java 2.0. Also we got the real time medical database with 2280 itemsets and 4200 elements. The running time comparison between improved algorithm and Apriori algorithm are shown in the Figure 1 with minimum support ranges from 1 percentage (%) to 5 percentages (%).

The importance of improved algorithm is to reduce the number of items in each and every scan and also reduce the size of the original dataset. There are three aspects to make this algorithm better than the original one.

Figure 1: Running time between Apriori and improved algorithm Firstly, when the candidates are being produced, instead of dealing with all the items of the previous large set, only the elements which having the same transaction ids are crossed. At the same time, generating frequent patterns, it may reduce the computing time dramatically and the size of the database is reduced. Secondly, by pruning, the number of elements in the candidate sets is decreased once more based on modified database. Finally, the computing time and storage space are saved.


# VI.


# Conclusion

In this research paper, it has been proposed an improved algorithm for mining frequent pattern based on Apriori-like algorithm. The main advantages of an improved algorithm are that it can reduce the number of scanning by the transposed database D T , redundancy by the time of generating sub-transaction set tests and verifying them in the database. In order to discover frequent patterns in massive datasets with more columns than rows, it has been presented a complete framework for the transposition; the item set in the transposed database of the transposition of many classical transactions is given. Also it has been compared the classical Apriori algorithm with an improved algorithm. It has been presented the experimental results, using synthetic data, showing that the proposed algorithm always outperform Apriori algorithm. Hence, the proposed algorithm is very much suitable for a massive datasets.

VII.
![al. [AIS 93]. The most important time-consuming part of the association rule algorithm is to discover large itemsets, while the generation of association rules from the given large itemsets is straightforward. This paper has been D Global Journal of Computer Science and Technology Volume XII Issue XIII Version I](image-2.png "")


III.
1DD T
2NotationsDescriptionDGiven databaseD TTransposed databaseCTCandidate transaction IDsCT 1Candidate transaction IDs of size-1LT 1Large transaction IDs of size-1CT k-1Candidate transaction IDs of size-k-1LT k-1Large transaction IDs of size-k-1sMinimum supportcMinimum confidenceCountFrequency
			© 2012 Global Journals Inc. (US)
		
		
## Acknowledgements

The authors are extremely express gratitude to Dr. Omar Sayed Al-Mushayt, College Dean and Dr. Saeed Q Al-Khalidi, Vice Dean, College of Computer Science and Information Systems, JAZAN University, Kingdom of Saudi Arabia for having noble and continuous encouragement to complete this research. The special thanks also to the University President, JAZAN University, Kingdom of Saudi Arabia for inspiration and persistent support directly or indirectly for the completion of this research.

			
* 
	
		Fast algorithms for mining association rules in large databases
		
			RAgrawal
		
		
			RSrikant
		
		
			1994
		
	
* 
	
		VLDB '94: Proceedings of the 20 th International Conference on Very Large Data Bases
				San Francisco, USA
		
			
* 
	
		Emergence of scaling in random networks
		
			ABarabasi
		
		
			RAlbert
		
	
		Science
		
			286
			
			1999
		
	
* 
	
		Efficiently mining long patterns from databases
		
			RJBayardo
		
	
		SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
				Seattle, USA
		
			1998
			
		
* 
	
		Dynamic itemset counting and implication rules for market basket data
		
			SBrin
		
	
		SIGMOD '97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data
				Tucson, USA
		
			1997
			
		
* 
	
		The age specific degree distribution of web-graphs
		
			CCooper
		
		
			2006
		
	
* 
	
		
		Combinatorics, Probability and Computing
		
			15
			5
			
		
* 
	
		Mining frequent patterns without candidate generation
		
			JHan
		
	
		SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data
				Dallas, USA
		
			2000
			
		
* 
	
		
			JHan
		
		
			PeiKamber
		
		Data Mining Concepts and Techniques
				
			The Morgan Kaufmann Series