# Introduction lassification rules can be represented as below.

Consider the information about the insurance company information.

Insurance info (age: integer, cartype: string, highrisk: boolean) if age is between 16 and 25 and cartype is either sports or truck, then the risk is high.

Consider the information about the insurance customers.

Trees that represent classification rules are called classification trees or decision trees.

Data uncertainty arises naturally in many applications due to various reasons. We briefly discuss three categories here: dimension errors, data mustiness, and repeated dimensions. a) Dimension Errors: Data obtained from measurements by physical devices are often imprecise due to dimension errors. b) Data mustiness: In some applications, data values are continuously changing and recorded information is always out of date. c) Repeated dimensions: Perhaps the most common source of uncertainty comes from repeated dimensions. For example, a patient's body temperature could be taken multiple times during a day.


# Type-1 Probabilistic Relations

Type-1 uncertainty refers to confidence if a tuple belongs to a relation or not. Consider the table represents a part of my personal address book. It is not really likely that my address book contains the phone number of the Dutch Queen, where it is very likely that the address book contains the phone number of one of my fellow students, Ruud van Kessel.


# Type-2 Probabilistic Relations

With Type-2 uncertainty, the value of the keyattribute is deterministic but values of other attributes in the relation may be uncertain.  and King Carl XVI Gustaf represents uncertainty, it is not possible to tell in which village they live with complete certainty, based on this list. If the probability that the join pair meets the join condition exceeds the threshold, it is included in the result, otherwise the pair is not included. This threshold can either be user specified or a system parameter. The tuple pairs when their probabilities exceed a certain threshold as Probabilistic Threshold Join Queries (PTJQ) we focus on threshold joins and develop various techniques for the efficient (in terms of I/O and CPU cost) algorithms for PTJQ. In particular, we develop three pruning techniques:

(a) Type-1 Probabilistic Relations (b) Type-2 Probabilistic Relations 1) item-level pruning, where two uncertain values are pruned without evaluating the probability. 2) page-level pruning, where two pages are pruned without probing into the data stored in each page. 3) index-level pruning, where all the data stored under a subtree is pruned. Two useful types of join operations specific to uncertain attributes: value join (v-join) and distribution join (d-join). V-join is a natural extension of the join operation on deterministic data. The PDF (probability Density Function) can be used to calculate the Range of values to an attribute which contains attribute uncertainity. PDF also calculates probability of matching uncertain tuples present in different relations while performing join operation. Each joinpair is associated with a probability to indicate the likelihood that the two tuples are matched. We use the term Probabilistic Join Queries (PJQ). For join conditions over uncertain data, the result is generally not boolean, but probabilistic.


# II.


# Related work

The model for managing uncertain data is proposed in moving-object environments and in sensor networks. Recently, the Trio System has been proposed to handle such uncertainty. Another representation of data uncertainty is a "probabilistic database", where each tuple is associated with a probability value to indicate the confidence of its presence. Probabilistic databases have also been recently extended to semistructured data and XML. Probabilistic queries are classified as value-based (return a single value) and entity-based (return a set of objects). Probabilistic join queries belong to the entity-based query class.

Aggregate value queries and nearest neighbor evaluation algorithms are presented. To our best knowledge, probabilistic join queries have not been addressed before. Also these works did not focus on the efficiency issues of probabilistic queries. Although examine the issues of query efficiency, their discussions are limited to range queries. There is a rich vein of work on interval joins, which are usually used to handle temporal and one-dimensional spatial data. Different efficient algorithms have been proposed, such as nested-loop join, partition-based join, and index-based join. Recently the idea of implementing interval joins on top of a relational database. All these algorithms do not utilize probability distributions within the bounds during the pruning process, and thus potentially retrieve many false candidates. We demonstrated how our ideas can be applied easily to enhance these existing interval join techniques.


# III.


# Implementation

The INLJ (IndexedNestedLoopJoin) algorithm can recover I/O performance by organising the pages in a Tree structure. Let R and S denote the two relations that are being joined, and assume that R has fewer tuples than S. If neither join input has an index on the joining attribute, the indexed nested loops join algorithm first builds an index on the smaller input R.

The index is built by extracts the keypointerinformation for each tuple. The key-pointer information is then spatially sorted based on the MBR. We can develop the efficient query join processing technique by the following sequence of operations. Although uncertainty tables can be used to improve the performance of page-based joins, they do not improve I/O performance, simply because the pages still have to be loaded in order to read the uncertainity tables.In INLJ we can use Interval Index.

Conceptually, each tree node still has an uncertainty table, but now each uncertainty interval in a tree node becomes a Minimum Bounding Rectangle (MBR) that encloses all the uncertainty intervals stored in that MBR. Page-level pruning now operates on MBRs instead of uncertainty intervals. e) Construct the Decision Tree for Query processing Splitting of an attribute depends on the attribute selection measures (Information gain, Gain ratio, Gini Index).Higher value of an attribute can be selected as splitting one. In this way the output Can be represented in the Decision Tree form by classifying the result into different classes.


# IV.

Performance Index-Level Pruning The above problem can be alleviated by organizing intervals with an index. shows that both INLJ and U-INLJ have a much better performance in Npair than BNLJ and U-BNLJ.

In the above Graph, the comparison between page-level join algorithms with the Index-level join algorithms (INLJ and U-INLJ). In the Index-level join algorithms whenever the Threshold increases the output candidate pairs are Reduced. So, we can join the tables based on most similarity tuples. This leads to high performance in the Results, Next we can compare the Execution time between R-tree based join algorithm and INLJ(Indexed-Nested Loop Join) algorithm are compared in the below graph. The Horizontal row specifies the size of the Datasets and vertical row specifies the Execution time in seconds. In all different type of Datasets the Execution time of INLJ is Better than R-tree based join algorithm.

Finally I can prefer the Indexed-Nested Loop join algorithm as a Probability Threshold Joining Algorithm for Removing the Uncertainity while Joining of multiple table where the joining attribute has uncertain values. so, the Result of joining is efficient and we get the close to Exact Results.

V.


# Conclusion

Uncertainity management is the mounting topic in Data mining in recent times. In this paper we identify the situation of maintaining uncertain attributes present in the database relations. We suggest a method for getting better join processing of relations in requisites of I/O cost which are having uncertainity attributes present. In this paper we propose the implementation of INLJ, which is capably handle the uncertain values when compared to the earlier uncertainty handlings. 
![a) Data Refinement Take any Real-world Data which is possible to containing Uncertainity. Clean the data i.e. removing unnecessary data for Our project and Represent the most appropriate Data. b) Formulating Range values using PDF function (Probability Density Function) PDF summarizes how odds/probabilities are distributed among the events that can arise from a series of trials.By using PDF function we can replace the uncertainity values as Ranges. c) Similarity matching between Uncertainty tuples (By using probabilistic Joining Queries) Calculate the probability of joining the two uncertainity Query Join Processing Over Uncertain Data for Decision Tree Classifiers Computer Science and Technology Volume XII Issue XII Version I tuples. Each join-pair is associated with a probability to indicate the likelihood that the two tuples are matched. d) Removing Uncertainity by using INLJ](image-2.png "")
AgeCartypeHighrisk23SedanFalse30SportsFalse36SedanFalse25TruckTrue30SedanFalse23TruckTrue30TruckFalse25SportsTrue18SedanFalse
			© 2012 Global Journals Inc. (US)
		
		
* 
	
		Induction of decision trees
		
			JRQuinlan
		
	
		Machine Learning
				
			1986
			1
			
		
* 
	
		C4.5: Programs for Machine Learning
				
			Morgan Kaufmann
			1993
		
	
* 
	
		Efficient evaluation of imprecise location dependent queries
		
			JChen
		
		
			RCheng
		
	
		ICDE
				Istanbul, Turkey
		
			IEEE
			20 Apr. 2007
			
		
* 
	
		Uncertain data mining: An example in clustering location data
		
			MChau
		
		
			RCheng
		
		
			BKao
		
		
			JNg
		
	
		PAKDD, ser
		Lecture Notes in Computer Science
		Singapore
		
			Springer
			12 Apr. 2006
			3918
			
		
* 
	
		Efficient indexing methods for probabilistic threshold queries over uncertain data
		
			RCheng
		
		
			YXia
		
		
			SPrabhakar
		
		
			RShah
		
		
			JSVitter
		
	
		VLDB
				Toronto, Canada
		
			Morgan Kaufmann
			31 Aug.-3
		
	
* 
	
		Querying imprecise data in moving object environments
		
			RCheng
		
		
			DVKalashnikov
		
		
			SPrabhakar
		
	
		IEEE Trans. Knowl. Data Eng
		
			16
			9
			
			2004
		
	
* 
	
		
			TMMitchell
		
	
		Machine Learning. McGraw-Hill
		
			1997
		
	
* 
	
		Evaluating probabilistic queries over imprecise data
		
			RCheng
		
		
			DKalashnikov
		
		
			SPrabhakar
		
	
		Proc. SIGMOD
				SIGMOD
		
			2003
		
	
* 
	
		Efficient indexing methods for probabilistic threshold queries over uncertain data
		
			RCheng
		
		
			YXia
		
		
			SPrabhakar
		
		
			RShah
		
		
			JVitter
		
	
		Proc. VLDB
				VLDB
		
			2004
		
	
* 
	
		Efficient temporal join processing using indicies
		
			DZhang
		
		
			VTsotras
		
		
			Seeger
		
	
		Proc. ICDE
				ICDE
		
			2002