# Introduction arketers are always striving to develop effective marketing campaigns by maximising the benefits they can achieve from the data they hold in their databases. The results of different campaigns can then be assessed by comparing the return on investment of the campaigns. Whatever the nature and aims of the campaigns, they always start with some form of analysis to gain customer insight, and the results of this analysis lead to segmentation and targeting of potential customers. As companies move through the cycle of customer acquisition to customer retention, the ability to analyse the data they hold on their customers becomes increasingly important. However, this task is often hampered by the fact that the vast majority of databases have missing information, sometimes called gaps or holes. This may be for historical reasons where the emphasis was on customer acquisition or it may be due to changes in the needs of the organisation over time. Whatever the cause, missing data are a very common and serious problem -it is not uncommon for collected lifestyle data to have 30-40% of their values missing. This is a real problem Author : e-mail: atai.winkler@win-tech.co.uk when the data are to be analysed -if the data are not there, they cannot be analysed either at record level or for the database overall. These missing values mean that the database is not as large or as rich in information as may be assumed from its overall size (number of fields and number of records). Thus, the effect of the missing data is to limit the amount and quality of new information that can be learnt from the database. With respect to customer retention programmes and CRM, missing customer data will have a serious adverse impact on the outcome of the campaigns because they will be based on incomplete data and therefore weak analysis. The consequence of this is that the segmentation and targeting will be less accurate than they would be if the database were fully populated. This problem of missing data has also heavily affected the lifestyle data sellers who are trying to present a more complete picture of the UK adult population. Since less data are acquired now than formerly directly from individuals about their demographics as consumers, there is a greater reliance on modelled data. Unfortunately, these modelled data are often obtained from incomplete data. This just compounds the problem of missing data because any bias in the available data manifests itself, and is probably increased, in the modelled data. Thus, a vicious circle of partial data being used to obtain more partial data is started, and after this process has been repeated many times it is likely that the final data bear little resemblance to the true UK population. Since the problem of missing values in both customer and lifestyle databases is widespread and getting worse, methods that give more accurate estimates of the missing values compared to those obtained using currently available methods have an important and significant contribution to play in improving marketing effectiveness. The desired result of all these methods is a fully populated database with all the missing values replaced by estimates of their true values -a process known as (missing value) imputation. The imputed values should be 'good' and plausible estimates of the true values so that the statistical properties of the fully and partially populated databases are as similar as possible. # II. # Missing Data and Empty Data It may be thought that all missing data can and should be imputed. This is not always the case because some data may be missing because they cannot have real values -such values must remain missing. Thus, it is important to understand the difference between 'missing' data and 'empty' data. A value that is not known but which has a real value is a 'missing' value. A value that is not known but which cannot have a real value is an 'empty' value. Therefore, missing data can be imputed but empty data cannot be imputed. # III. # Imputation Methods A number of imputation methods are available, and some of the factors that help determine a suitable method are: ? Is the method only suitable for small databases or can it be used on small and large databases? Case deletion avoids rather than solves the problem of missing values because it ignores all the incomplete records. Very often it is the default method of handing missing data. Although very easy to implement, two immediate and severe disadvantages of the method are firstly that a very large proportion of the records may be ignored and secondly that the remaining records may not be representative of the population. The commercial and financial implications of this bias in the data that are actually analysed are easy to imagine. # c) Mean Substitution and Cold Deck Imputation Mean substitution and cold deck imputation are two frequently used imputation methods. Mean substitution involves replacing all the missing values in each field by the field's mean or mode as appropriate, and in cold deck imputation the missing values are replaced by external constants, one for each field. These methods are easy and quick to implement but being global methods they are very unlikely to maintain the statistical properties of the database. In the case of mean substitution, the mean (mode) values of the fields in the partially and fully populated databases are, by definition, the same, but the variation of each field in the fully populated database is much smaller than the corresponding variation in the partially populated database. The result is that the records are not as clearly differentiated as they should be and so it is harder to understand how people's individual characteristics determine their actions and behaviour from a database which has been fully populated in either of these ways. Another major problem with mean substitution and cold deck imputation is that unrealistic or even impossible values can be easily imputed. This is because the value imputed in any one field is the mean of the known values in that field. Therefore, if a database contains people across a wide range of age, income and lifestyle attributes and the data can be segmented into a finite number of homogeneous clusters with high inter-cluster heterogeneity, the mean value of any field across all clusters does not have meaning or significance for any single cluster or all the clusters. Therefore, using values imputed in this way as the basis for marketing campaigns and other commercial activities may not yield the desired outcomes because the targeting and segmentation are based on poor quality dat. A slightly more advanced method of imputation is hot deck imputation. This is similar to cold deck imputation except that the missing values are replaced by values from 'similar' records in the database. These similar records are obtained by clustering the complete records and then assigning a cluster to each incomplete record. The missing values in each incomplete record are replaced by values calculated from its associated cluster. Like mean substitution and cold deck imputation, hot deck imputation is a global method. # e) Regression In regression imputation the missing values are replaced by values calculated from a regression equation, for example y = a + bx1 + cx2 (1) y is the variable to be imputed, and x1 and x2 are other variables ( a , b and c are known constants). Implicit in using ( 1) is that the values of the variables on the right hand side of it ( x1 and x2 ) in records whose values of y are to be imputed are known. This problem can be overcome by developing the models only from complete records -but this raises a fundamental problem, namely what happens if the complete records are either a small proportion of the database or they are a distinct group in the database rather than being a fair reflection of the database as a whole? On the other hand regression imputation is a local method because the missing values in each record are calculated from the data in that record -a significant advantage. Notwithstanding this advantage, regression imputation has a number of practical and theoretical problems, including: ? since a regression equation must be developed for each variable with missing values, regression imputation is very time consuming, especially in large databases and in databases many of whose fields have missing values; ? working out the equations may be difficult, not least because the correlations between the variables may be weak; ? different relationships may exist for different homogeneous groups in the database and so trying to find one relationship across all groups will yield an unsatisfactory compromise that is not an accurate portrayal of the relationship in any one group -the single equation will predict values that do not reflect any individual's unique characteristics, and so the same problems as those associated with mean substitution, namely reduction in the heterogeneity of the database, may arise; ? the relationships between the variables are artificially and falsely inflated because the missing values are estimated by substituting into the regression equations. # f) EM Algorithm and Structural Models These methods are very advanced and demanding in terms of the time and expertise required. They are not amenable for use on large databases. k-nearest neighbours is a data mining method used in estimation and classification problems. Unlike many other methods used in statistical data analysis and modelling, it does not require a model to be developed for each field. Rather, it is based on the simple concept that the (statistical) similarity between two records is calculated from the multivariate distance between them. If two records are similar, i.e. their corresponding fields have similar values, they will be close to one another and so the distance between them will be 'small' when their common known data are plotted. These records are more similar to one another than are other records with larger distances between them. This geometric way in which the most similar records are found explains why the method is called k-nearest neighbours. Thus, the method involves mapping all the data into multi-dimensional space and then calculating the distances between all pairs of records (each dimension is a variable). The method works by firstly finding a pool of donors, i.e. complete records, for each recipient, i.e. incomplete record. It then uses the values in the donors of each field that is missing in the recipient to impute the missing data in the recipient. There are three stages to the method. For each incomplete record: 1. Search the entire database for similar complete records using the values in the selected fields in the incomplete record. 2. Rank the complete records by distance to the incomplete record. 3. Use the specified number of complete records in the ranked set to impute the missing values in the incomplete record. By using all the known data in each recipient to search for its most similar donors and then using these donors to impute the recipient's missing values, the statistical properties of the partially populated database are maintained. This process of searching for similar donors and then using them to impute the missing values is repeated for each recipient. This recipient-by-recipient approach means that each recipient has its own donors, i.e. Nearest Neighbours (NNs), from which its missing values are imputed. It is this feature of k-nearest neighbours that helps maintain the heterogeneity of the database. This very localised approach to imputation is in marked contrast to global methods where the imputation is based on groups of recipients and each group has the same donors and therefore the same imputed values. Thus, the variation of the variables in a database which has been fully populated using a global method is lower than it is in a database which has been fully populated using k-nearest neighbours. Furthermore, since each recipient record is treated individually, the method obtains the most accurate imputed values for each recipient record rather than attempting to obtain the most accurate average imputed values across a group of records. The main reason for the limited use up to now of the k-nearest neighbours method with very large databases is that the number of distances that have to be stored and then ranked made it impractical to use on such databases. This problem has now been overcome so that it does not store the distance from the incomplete record being processed to each complete record, rank all the distances and then select the specified number of NNs. This means that only a fraction of the number of complete records in the database are stored at any one time. A good example of the type of data that can be imputed using k-nearest neighbours is lifestyle data. However, variables such as ownership of pets, type of credit card owned and participation in hobbies, for example stamp collecting, should not be imputed because they are generally independent of other variables and do not define people. To The table shows that the three NNs for record 3 are records 4, 7 and 2, with record 4 being the most similar and record 2 the least similar of the three NNs. Since there are six complete records in the database, there can be up to six NNs. If only one donor (the NN) is to be used, the missing values in records 3, 5, 8 and 10 are copied directly from the corresponding fields in records 4, 1, 2 and 9 respectively. If two NNs are to be used, the missing values in records 3, 5, 8 and 10 are calculated from the corresponding fields in records 4 and 7, 1 and 2, 2 and 7, and 9 and 1 respectively. If three NNs are to be used, the missing values in record 3 are calculated from the corresponding fields in records 4, 7 and 2, record 5 from records 1, 2 and 4, record 8 from records 2, 7 and 4, and record 10 from records 9, 1 and 2. The fully populated database shown in Table 3 was obtained by using the three NNs shown in Table 2. As with many methods of data analysis, the trade-off between run-time and accuracy must be considered -a big increase in run-time is not always accompanied by a significant improvement in accuracy. For k-nearest neighbours an obvious question is how the number of donors affects the accuracy of the imputed values. It is reasonable to assume that as the number of donors increases, the 'better' will be the imputed values. However, increasing the number of donors has two adverse effects: firstly, as the distance between the recipient and the donors increases, the donors become more dissimilar from the recipient; and secondly, the run-time increases. This and other questions are discussed later in this paper. # Example of Using k-Nearest Neighbours For Imputation: The use of the k-nearest neighbours method for imputation was tested on a database of 20,000 records, of which 13,303 (66.5%) were fully populated and the remaining 6,697 (33.5%) had at least one missing value. The actual values of these missing values were known so that after the imputation the imputed values could be compared with the actual values. Table 3 Table 4 The table had 8 fields, as described in Table 5. Four runs were carried out. The settings of the runs are shown in Table 6. # h) Imputation Results The results of the imputation for the text variables are shown in Table 7. Volume Income was the only ordinal field. It was divided into 10 levels, and its null count was 1,255. The results of the four runs are shown in Table 9 as the percentage of entries in cells off the Leading Diagonal (LD) in the cross classification matrix (this is just a crosstab of actual values against imputed values). Table 10 IV. # Discussion of Results Table 7 shows that the percentage of correctly imputed text variables is not very sensitive to the sampling percentage or to the number of NNs. This is because the records are randomly distributed in the database rather than there being a number of distinct and homogeneous groups in the database each of which is concentrated in adjoining records, and the variation of the fields in the NNs hardly increases as the number of NNs increases. It is expected that as the number of NNs increases the variation of the fields in the NNs would also increase because the (complete) records in the search domains are ranked by similarity to the incomplete record. This means that the record most recently added to the search domain is always most similar to the previously added record and least similar to the first record in the search domain. If the variation of the fields in the NNs were much greater, the success of the imputation would be much more sensitive to the number of NNs. In general, the rate at which the variation of the fields in the NNs increases depends on the data and the number of NNs. Another interesting result in Table 7 is the variation of the percentage of values correctly imputed across the fields; for example in run 1 it is between 37.43 and 90.62. Now, it is reasonable to assume that as the number of categories of a text field increases, the percentage of values correctly imputed would decrease (all other things being equal). However, there is another more important factor at play, and that is the standard deviation of the relative frequencies of the categories. Table 11 shows this standard deviation, the rank of the standard deviations, the rank of the percentage of values correctly imputed for any of the four runs in Table 7 and the rank of the number of categories for all the text fields (1 is the smallest rank and 6 is the largest rank). There is an almost perfect 1 to 1 correspondence between the third and fourth columns in Table 11. # Run The only discrepancy occurs with the ranks of h_shrs and h_prop -their standard deviations are more similar to one another than are other pairs of standard deviations. This suggests that the variation in the relative frequencies of the categories of the fields is a significant factor in determining the percentage of values correctly imputed. However, the fourth and fifth columns in Table 11 appear to follow a weak inverse relationship -the larger the number of categories a text field has, the smaller is its percentage of values correctly imputed likely to be. The results in 12. Table 9 shows that for all the runs about half the values were imputed correctly or within one level either side of the LD, and that the percentages fall off very quickly as the cells move away from the LD. Comparing Tables 6 and 10, it is immediately apparent that the run-time is strongly influenced by the number of complete records from which the imputed values are calculated and not affected at all by the number of NNs. This is as expected because the number of distances calculated is given by the product of the number of incomplete records and the number of complete records used. The number of NNs does not determine the run-time because the software used in this study has a very powerful sorting algorithm which sorts the distances as they are processed. This means that the number of distances actually stored as each incomplete record is processed is kept to a minimum and is very close to or equal to the number of NNs specified at input. The alternative solution to this sorting problem is to store the distance for each # Position No records have been processed rank all the distances complete record and then when all the complete from smallest to largest. This approach has two big computational problems: firstly, the larger the database the more distances have to be stored; and secondly the time required to sort these distances is not insignificant, and indeed may be greater than the time required to calculate the distances and impute the missing values. V. # Conclusion This paper has presented the results of a study on how a powerful data mining technique, k-nearest neighbours, has been enhanced and then applied to the ever-present problem of how to impute missing values in large databases. The enhancements make the method amenable for use on very large commercial databases. The results of a study presented in this paper show that its overall accuracy is very high. The advantage of having more accurate imputed data is that campaigns can be better targeted. In turn, this will generate higher response rates and a greater return on investment. 2show how k-nearest neighbours works,consider the data in Table 1. The data come from asurvey and the fields are:mar_stat: marital status: D (divorced); M (married); S(single)res_stat: residential status: P (owner-occupier); T (rentalone); Z (multiple rent)age: age (months)bank: time with bank (months)cheq_card: own a cheque guarantee card: N (no); Y(yes)add: time at current address (months)emp: time with current employer (months)occup: occupation code 1 2 FieldDescriptionTypep_hhldhead of householdtexth_comphousehold compositiontexth_shrsvalue of shares heldtextYear 2014h_prop h_inc h_res h_tentype of property household income residence type household tenuretext ordinal text text46ageagecontinuousRun No.SamplingNo. of Complete No. of NearestPercentageRecords Sampled Neighbours1101,300102101,300203566510456651D D D D ) c(FieldNullNo. of% Correctly% Correctly%Correctly% CorrectlyCount p_hhld 991290.6290.4190.3188.50h_comp 1,0901237.4335.2333.9430.83h_shrs 1,090376.9777.3476.7975.32h_prop 1,189583.5283.3582.5179.14h_res1,255544.9445.1045.8240.24h_ten1,189368.0466.6968.2960.64 5678Run No.MEMAE10.6412.6921.1512.8231.0712.9040.7114.21 11% Corr % 1 off % 2 off % 3 off % 4 off % 5 off % 6 off % 7 off % ?8 offNo. ImputedLDLDLDLDLDLDLDLD117.5332.9125.0214.026.932.630.720.24217.0531.3924.8616.107.092.710.80315.6231.0823.5916.028.843.820.800.23417.1329.7220.8815.309.244.462.070.880.32Run No.Run-Time (Mins)147248324424 Year 201448Volume XIV Issue III Version ID D D D ) c(Global Journal of Computer Science and Technology 12. of ValuesLD101 off LD182 off LD163 off LD144 off LD125 off LD106 off LD87 off LD68 off LD49 off LD2 © 2014 Global Journals Inc. (US) © 2014 Global Journals Inc. (US) Maximising the Value of Missing Data ## Global Journals Inc. (US) Guidelines Handbook 2014 www.GlobalJournals.org * Multivariate Statistical Methods, A Primer BManly 1994 Chapman and Hall * K-Nearest Neighbour as Imputation Method: Experimental Results GE A P ABatista MCMonard 2002 University of Sao Paulo Technical report * An Analysis of Four Missing Data Treatment Methods For Supervised Learning' GE A P ABatista MCMonard Applied Artificial Intelligence 17 2003