# I. Introduction

ata cleaning is a step for discovery of database. Data cleaning, it is also known as data cleansing, it is a phase in which noisy data, anomalous data and irrelevant data are removed from the collection of various data. Missing data are defined as some of the values in the data set which are either lost or not observed or not available due to natural or non natural reasons. Data with missing values confuses both the data analysis and the submission of a solution to fresh data. Thus, three main problems arise when dealing with incomplete data. First, there is a loss of information and, as a consequence, a loss of efficiency. Second, there are several complications related to data handling, computation and analysis, due to the irregulaties in data structure and the impossibility of using standard software. Third, and most important, there may be bias due to systematic differences between observed and unobserved data. Deal with missing data is major task for cleaning data. Noor et all [1] In this paper, three types of mean imputation techniques introduced on missing data. Rubin [7] explored about inference and missing data and multiple imputations for non-response in the survey. Allison [8] investigated estimates of linear models with incomplete data and on missing data. Smyth [9] and Zhang [10] have considered that data preparation is a fundamental stage of data analysis. Therefore, this research focuses on anomalous and missing data values. In our research we create a novel method to replace the missing values.


# II. Missing Data Methods

There are several methods for treating missing data. Missing data treatment methods can be divided into three categories, as proposed in [7].


# a) Ignoring and discarding data

In this method the two main ways to discard data with missing values. The first method is known as list wise deletion. It consists of discarding all instances with missing data. The second method is known as pair wise deletion method. It consists of discarding instances or attributes before deleting any attribute, it is necessary to evaluate its relevance to the analysis.


# b) Parameter estimation

In this missing data treatment method, Maximum likelihood procedures that use variants of the Expectation-Maximization algorithm can handle parameter estimation in the presence of missing data.


# c) Imputation

Imputation method is a class of procedures that aims to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set assist in estimating the missing values [9]. Mean Above Below Method: [1] this method replaces all missing values with the mean of the data above the missing value and one data below the missing value. Mean Above Method [1]: This method replaces all missing values with the mean of all available data above the missing values.

Mean Method [1]: This method replaces all missing values with the mean of all available data.

As per the figure 3 , the missing value case is by the subscript of the attribute and denoted by the variable x i. after pointing missing value case, we have to record the three upper value(x i-1 ,x i-2 ,x i-3 ) and three lower value(x i+1 ,x i+2 ,x i+3 ) from the missing value subscripts. Now the anomalous value in this subset is detected by the percentage change formula. After computing the percentage change of the subset. Now, we find the outlier range, value of outlier range define as per the suitable of the array value. If the anomalous value is detected in the data set, remove that value from The proposed Methodology works in two stages. The first stage is localizing missing data and remove anomalous data, in next stage we substitute the estimated value in the place of missing values by using proposed method. This calculation gives the effective result and decreases the biasness of result.

The working stream of proposed work is shown in figure 2, if there are missing values in the raw data set, then the small subset/array is created from the input data sheet in which missing data value is existing, along with this we work out for anomalous value, according if anomalous value is presented, replace anomalous value with the new calculated value, last step of the work is estimation of missing values using Euclidean distance. 2 Here, X a is centroid of the array Y a is particular value of the array at the last we compute the average of the Euclidean distance and add centroid with average value of distance this is the estimated value of missing value. The value of X est (estimated value) is separately computed for every missing value in the complete datasheet.


# IV. Results and Discussion

Our experiments were carried out for time series datasets taken from Earthpolicy. Below graph figure 4 shows comparison with respect to mean of all method. The U.S. Motor Gasoline Consumption respectively for the years 1950-2014 for million barrels attribute. The mean consumption of u.s. motor gasoline of million barrels are 2714.the variables are observed and missing values it may be noted that in the planned way 20% values are missing in the random manner for all the variables and in this dataset value of outliers is greater than 5.The mean calculated from incomplete data sets are 2379 this value is slightly lower than the mean values. The proposed methodology applied on the data sets to fill up missing values and the value is 2714.It is observed that the mean values are obtained after replacing missing values by proposed work are close to the actual mean. The results from the proposed method are compares with the techniques like mean above below(MAB), mean above(MA),mean imputation(MI), mean comparison method proposed by Noor et all [1] and analyze shows that proposed method value substitute missing values are more close to the original method with respect to the other method.

In figure 5 & figure 6 shows the comparison with respect to standard deviation value and coefficient of variance value of all methods. The proposed method performed significantly better than all other methods.   The data sheets are imported in the SPSS and necessary tests for the data validation and significance were applied. On the SPSS software the results are checked by using the ANOVA test for the data sheet and significance value is 99.1% that shows the result is efficient and more compatible with original data.


# V. Conclusion

The work focuses on imputing missing values using proposed methodology for numerical attribute in time series data sheet. This method is suitable to handling missing data alone in presence of anomalous data. In this work, performance of proposed method is more reliable as comparing to other mean imputation technique for data analysis in the data mining field. 
1![Figure 1: Types of mean imputation method Mean Imputation Method: In this technique, It consists of replacing the missing data for a given feature by the mean of all known values of that attribute in the class where the instance with missing attribute belongs mean of each attribute that contains missing values is calculated and is replaced in the place of missing values. Each missing value is substituted with calculated mean value which is same for all.](image-2.png "Figure 1 :")
2![Figure 2: Design flow of proposed methodology III. Proposed Method for Inference of Missing Attributes Value in Data Mining](image-3.png "Figure 2 :")
3![Figure 3: Calculation of percentage change for outlier detection b) Calculation for missing values Estimation of missing values in the last phase, when we have the outlier-free data. we process to fill missing values in this array, firstly calculate centroid of the subset ,centroid is generated by the mean of subset. At the further stage Euclidean distance is calculated between centroid of the data and the each value of the](image-4.png "Figure 3 :")
![org site. In proposed work we used different Datasheet like Hydroelectric Generation in India 1965-2013, Average Global Temperature 1880-2014, U.S. Motor Gasoline Consumption 1950-2014, World Wood Production 1961-2011 and few more. Here, we evaluate the U.S. Motor Gasoline Consumption 1950-2014 contains 50 number of instances and two attributes.](image-5.png "")


U.S. Motor Gasoline Consumption, 1950-201425value20consumption5 10 151720 17 171617 17cv0Year 2016methods76Volume XVI Issue V Version I)(Global Journal of Computer Science and Technology
			© 2016 Global Journals Inc. (US)
		
		
* 
	
		Mean imputation techniques for filling the missing observations in air pollution dataset
		
			MNNoor
		
		
			ASYahaya
		
		
			NARamli
		
		
			AM MBakri
		
	
		Key Engineering Materials
		
			
			2014
		
	
* 
	
		Outlier detection and missing data filling methods for coastal water temperature data
		
			HYCho
		
		
			JHOh
		
		
			KOKim
		
		
			JSShim
		
	
		Journal of Coastal Research
		
			65
			
			2013
		
	
* 
	
		imputing large group averages for missing data, 4. using rural-urban continuum codes for density driven industry sectors
		
			JRPorter
		
		
			RECossman
		
		
			WLJames
		
	
		Journal of Population Research
		
			26
			3
			
			2009
		
	
* 
	
		Missing data analysis: Making it work in the real world. Annual review of psychology
		
			JWGraham
		
		
			2009
			60
			
		
* 
	
		Outliers in Statistical Data
		
			VBarnett
		
		
			TLewis
		
		
			1994
			John Wiley and Sons
			New York
		
	
* 
	
		Copy mean: a new method to impute intermittent missing values in longitudinal studies
		
			CGenolini
		
		
			HJacqmin-Gadda
		
	
		Open Journal of Statistics
		
			3
			
			2013
		
	
* 
	
		Inference and missing data
		
			DBRubin
		
	
		Biometrika
		
			63
			3
			
			1976
		
	
* 
	
		Estimation of linear models with incomplete data
		
			PDAllison
		
	
		Sociological methodology
		
			
			1987
		
	
* 
	
		Data mining at the interface of computer science and statistics
		
			PSmyth
		
	
		Data mining for scientific and engineering applications 35-61. Springer US
				
			2001
		
	
* 
	
		Data prepara tion for data mining
		
			SZhang
		
		
			CZhang
		
		
			QYang
		
	
		Applied Artificial Intelligence
		
			17
			5-6
			
			2003
		
	
* 
	
		Multiple imputation for missing ordinal data
		
			LChen
		
		
			MToma-Drane
		
		
			RFValois
		
		
			JWDrane
		
	
		Journal of Modern Applied Statistical Methods
		
			4
			1
			26
			2005
		
	
* 
	
		Missing value estimation for mixed-attribute data sets
		
			XZhu
		
		
			SZhang
		
		
			ZJin
		
		
			ZZhang
		
		
			ZXu
		
	
		IEEE Transactions on Knowledge and Data
		
			2011