# Introduction achine learning techniques are being adopted by various applications from different domains to build predictive models. These techniques are broadly classified as supervised learning and unsupervised learning based on the availability of class labels to build the model. Supervised learning methods require labeled data to build a classifier model that predicts the class labels of unknown examples based on the information available in the form of class labels. However, it is usually very expensive and timeconsuming process to collect the labeled data (Han et al., 2011). Even in domains with abundance of unlabeled data, labeled data are usually scarce and would require some effort to collect such data. However, to build classifier with better generalized accuracy, large number of labeled data is required, more so for datasets with high dimensionality -one of the problems associated with curse of dimensionality (Ramona et. Al.,2012). Accordingly, it is believed that with fixed number of labeled examples, the predictive power of the classifier decreases with the increase in number of dimensions thus requiring larger number of labeled examples for building classifier (Advani, 2011). In unsupervised learning methods such as clustering, unlabeled data, if available in abundance, suffice to extract hidden patterns of knowledge from a given dataset. Traditional clustering algorithms take into account the entire feature space to partition the datasets into clusters such that there is homogeneity among the instances within a cluster. The proximity between the instances in the cluster is measured in terms of distance function. However, with the increase in dimensions, the distance measures employed in the clustering algorithm becomes insignificant and clusters so produced will be meaningless. Hence clustering will full feature space, especially when the number of dimensions are large, may not produce good clusters. Finding the subset of feature space to produce meaningful clusters comes under the purview of subspace clustering. Subspace clustering focuses on finding a subset of features or a smaller set of transformed features with an aim to define cluster-able object spaces (Han et al., 2011;Sim et al., 2013). In high dimensional datasets due to exponentially large number of subsets of the feature set, subspace clustering techniques have to eliminate enormous possibilities before identifying the appropriate feature space that contain intrinsically significant clusters (Han et al., 2011). The basic research in subspace clustering falls into unsupervised learning as it tries to identify clusters based on the distribution of objects in various feature sub-spaces irrespective of the class labels of the objects. The clusters thus formed may be meaningful but may not be relevant to the intended purpose or context. For instance, the census data is described in terms of different features like social, economic, education, health, etc.,. However, it needs to be clustered in groups depending on the purpose of the data analysis. Features corresponding to social backwardness and eco-nomic status is used to identify the welfare schemes to be adopted, whereas features corr-espo-nding to place of living, commutability, etc., are used to decide the location of new amenities centers. In both the cases, features used and their relative significance will vary with the context or purpose thus requiring the clustering algorithm to give proper emphasis to appropriate features in accordance with the context for which the Context-aware-subspace clustering aims to find appropriate feature subspace for a given context represented in the form of class labels of a few labeled examples which are consistent with a large collection of unlabeled examples belonging to the same dataset. To the best of our knowledge, not much research was published in support of feature selection algorithms making use of combination of labeled as well as unlabeled examples. Hence semi supervised feature selection algorithms are needed to be developed for formation of context-aware clusters in domains having only limited examples labeled and the rest being left unlabeled. Semi Supervised Learning which is an integration of supervised and unsupervised learning; makes use of both labeled and unlabeled examples to build a model (Zhu and Goldberg, 2009). Semi supervised learning has two forms namely semi supervised classification and semi supervised clustering. Semi supervised classification uses both labeled and unlabeled data to build the classifier. Using the limited number of labeled data, probable class labels for the unlabeled data is derived which in turn is added to the pool of labeled data thus increasing the number of labeled examples (Han et al., 2011). The basic assumption in this technique is that the similar data will have same class labels (cluster assumption) (Chapelle et al., 2006;Wang et al., 2012). Different methods like self training, co-training, generative probabilistic models, graph based and support vector machines are used for semi supervised classification (Zhu, 2008). In semi supervised clustering, a large set of unlabeled data is accompanied by a small amount of domain knowledge in the form of either class labels or pairwise constraints (must-link and cannot-link) (Grira et al., 2004;Ding et al., 2012). This domain knowledge is used to guide the clustering of unlabeled data so that the intra-cluster similarities are maximized and intercluster similarities are minimized and there exist consistency between the partition and the available knowledge (Gao et al., 2006). Based on the above arguments, authors proposes context-aware semi supervised subspace clustering framework which leverages the domain knowledge in terms of class labels for at least some of the examples (if labeled examples are expensive) in order to estimate the suitability of the features to the intended cluster solution. Proper selection of features and their relative significance is essential in producing context-aware clusters which are probably uni-class clusters. Uni-class clusters contain all or majority of the elements belonging to same class label which is reflected in terms of cluster purity. The clustering framework is further extended to build a classifier which is referred to as semi supervised classifier that requires minimum information for prediction. The authors also proposes 'Semi Supervised Feature Relevance Estimation', (SFRE), algorithm to estimate the relevant features and their relative significance in terms of weights that define appropriate subspaces for different targets/context. The framework was tested on a few benchmark datasets from UCI repository which has given promising results. # II. # Related Work Researchers in the past came up with different methods for semi supervised learning. One popular approach is constrained based clustering. Constraint based methods uses pairwise constraints in the form of must-link and cannot-link that guides the clustering process to partition the data in a way that do not violate these constraints (Wagstaff et al., 2001;Basu et al., 2004;Lu and Leen, 2004). Recently Xiong et al., (2014) proposed an iterative based active learning approach to select pairwise constraints for semi supervised clustering. It uses the concept of neighbourhood that contains labeled examples of different clusters based on pairwise constraints. The uncertainty associated with each point's neighbor is resolved through queries. However, repeated clustering is required with growing list of constraints. Another popular approach for semi supervised clustering is distance based techniques which is based on the cluster assumption. Yin and Hu (2011) proposed semi supervised clustering algorithm using adaptive distance metric learning where clustering and distance metric learning are performed simultaneously. The clustering results are used to learn the distance metric and the data is projected into a low dimensional space such that data seperability is maximized. Gao et al., (2006) focused on semi supervised clustering in terms of features rather than examples. It addresses the problem where labeled and unlabeled dataset have different feature set with few common features. In terms of feature selection, Padmaja et al., (2010) proposed a dimensionality reduction approach that estimates the significance of features based on the fractal dimensions and accordingly selects a subset of features that are essential to capture the characteristics of the dataset. The algorithm detects all types of correlations among features to identify the essential features after eliminating the redundant and irrelevant features. Kernel based feature selection was also explored by a few researchers (Wang, 2008;Ramona et al., 2012). Clustering based feature selection for classification was proposed by Song et al., (2013) where features are clustered based on graph theoretic clustering method. Research on feature weighting and ranking concentrated more on supervised learning (Eick et al., 2006;Al-Harbi and Rayward-Smith, 2006;Zhao and Qu, 2009). Most of these research studies initially weigh the features by using some random guess or equal weights. These initial weights are then adjusted accordingly. Such approach may take much time to arrive at the final optimum weights if the initial guess is not appropriate. This paper deals with semi supervised learning methods with wrapper based feature selection method that uses discriminant analysis results to initialize the weights. These weights are adjusted accordingly in a stepwise refinement process using both labeled and unlabeled examples. The proposed framework is used to develop a classifier and a pertinent cluster solution. # III. Context-aware Semi Supervised Subspace Clustering Framework A dataset may be clustered in multiple ways by appropriately selecting a subset of features /attributes depending on the purpose. Hence to produce clusters conforming to a particular purpose or context, weights must be given to features that depict the importance of the feature. Researchers in the past initially start with a guess/random weights or equal weights to the feature and proceeds further to determine the more acceptable weights. Instead of starting with some arbitrary values, it is proposed to use the information from the available labeled data to initialize the weights which can be adjusted later. Authors thus propose usage of discriminant analysis that finds the relationship between the independent features (predictors) and the dependent feature (class label), to initialize the feature weights. Discriminant analysis is a method that is used to predict categorical value from a given set of independent feature. It assumes the independent features to be normally distributed. The linear equation of Discriminant analysis is (Equation 1) D=V 1 X 1 + V 2 X 2 + V 3 X 3 + ?.. + V i X i + a 1 Where D=Discriminant Score V i = the discriminant coefficient or weight of i th feature X i = Value of i th feature a = a constant Discriminant analysis thus identifies the relevant features and its coefficients reflect the relevancy of the feature. The outcome of the discriminant analysis in terms of coefficients is normalized and is used as initial weights for developing binary cluster solution where as development of multi-class cluster solution involves integration of results given through multiple discriminant functions. The proposed framework use potency index as per the approach given in Dharmavaram and Mogalla (2013) for determining the initial weights of various features based on the labeled examples in case of multiclass datasets. # b) Clustering Algorithm The initial weight vector is used to form the initial cluster solution by using any partitional clustering algorithm. The authors have chosen K-means algorithm for its simplicity and computational efficiency to deal with numerical features. While dealing with datasets described in terms of numerical attributes, generally Kmeans algorithm employs Euclidean distance to compute the distance from each data point to the cluster centroid. Euclidean distance assumes that all the features are equally important while forming the clusters. However, as discussed previously, weights of the feature will determine the relevancy of the feature in forming the desired cluster solution and accordingly Weighted Euclidean Distance metric is used for distance calculation which has the following equation (Equation 2): dw (x i , x j ) = ?? ?? ?? (?? ???? ? ?? ???? ) 2 ?? ?? =1 2 where ? ?? ?? = 1 ?? ?? =1 where w m indicates the weight of the m th feature. If the significance of the feature is more, its weight will be more. The weight of an irrelevant feature can be set to zero. For clustering, the number of clusters, K, is taken to be more than the number of classes. Larger values of K results in formation of large number of small uni-class clusters and hence, multiple clusters are associated with a single class. Each of these clusters The cluster concurrence is estimated for each cluster based on the agreement of the members of the cluster towards a particular class label and hence reflects the uni-class property of a cluster. In order to estimate the cluster concurrence of k th cluster, the support, S kj , available for each class, j, in that cluster is aggregated as shown in Equation 3. S kj = 1 |?? ?? | ? ?? × ?? ?? (??) ????? ??3 Where P j (n) indicates the probability of the example n belonging to the class j |?? ?? | is the cardinality of the cluster k, i.e., the number of examples that are assigned to cluster C k . M = ? 1 ???? ?? = ???????????? ????? { ?? ?? ?? (??)} 0 ????????????????? The binary term M acts as a deciding factor to indicate whether the example contributes to the support of class j or not. It may be noted that each example, whether labeled or unlabeled, contributes to the support of only one class: the unlabeled example support the class with the maximum probability, while the labeled example naturally support one and only one true class label. P j (n) is calculated as per the equation given below (Equation 4) where d(i,n) is the weighted Euclidean distance between i and n. ?? ?? (??) = ? ? ? ? ? 1 ???? ?? ? ?? ?????? ?? ?? = ?? ? 1 ??(??,?? ) 2 ???????? ?? ?????? ?? ?? =?? ? 1 ??(??,?? ) 2 ?? ?????? ?? ???? ?? ? ?? ? ?? ?? 0 ?????????????????4 The predicted label of an unlabeled example, t, is the label for which the probability is maximum. The cluster concurrence of k th cluster is estimated as: CC k = max j {S kj } Overall cluster purity of the cluster solution is taken as the weighted sum of individual cluster concurrences and is given below (Equation 5) The new algorithm, SFRE is guided by cluster purity estimated in terms of labeled as well as unlabeled examples belonging to various feature subspaces. The algorithm accepts the dataset D that includes L and U, initial cluster purity and the outcome of discriminant analysis as initial weights for formation of initial weight vector as input. The output of the algorithm is accurate relevance estimates of the feature set referred to as weight vector that defines the feature subspace for the given purpose indicated through class labels. CP = ? |?? ?? | |??| ???? ?? ?? ??=15 The cluster purity obtained by the initial weights is assigned to current cluster purity as initialization step, after which the algorithm executes the following three steps iteratively: ( D D D D ) Year C Step 1: Finding Relevant Features Step 2: Updating Weights Step 3: Check for convergence In the first step, each feature in the feature set is checked for its relevance. Taking one feature at a time, clusters are formed without that feature and cluster purity is estimated. If there is a decrease in cluster purity when compared to the current cluster purity, it indicates that the absence of the feature has resulted in the loss in purity and hence it is marked as relevant feature and its relevance increment is calculated based on the proportionate difference in the cluster purity estimated with and without the feature. If there is increase in the cluster purity when compared to the current cluster purity, it indicates that the absence of the feature has resulted in the gain in purity and hence it is marked as irrelevant feature. The outcome of this step is to mark each feature either relevant or not and to estimate the relevance increment for those relevant features. In the second step, based on the relevance marking, the weights are adjusted such that weights of the relevant features are incremented in accordance with the relevance increment calculated in step 1. The weights of those features marked irrelevant, are made zero and finally the weight vector is normalized to sum up to 1. In the final step, clusters are formed with the adjusted weights to judge the final solution. The new cluster purity obtained from clusters formed with updated weights and features is compared with the current cluster purity. If there is improvement in the cluster purity, the new weights are accepted and the new cluster purity is taken as the current cluster purity for comparison in the next iteration. The steps are repeated till there is not much significant improvement in the cluster purity. To change the order in which the features are selected in the subsequent iterations; features are randomly selected without replacement. This supports in avoiding any overlap or correlation in the features and to avoid local maxima. # e) Formal listing of Proposed Algorithm (SFRE) Let CPcurr be the cluster purity estimated for the initial cluster solution then stepwise refinement in weights proceeds as follows: Step 1: For each feature x, randomly selected without replacement from the feature set F Perform K-means without the feature x by appropriately normalizing the weight vector Estimate Cluster Purity CP F-x If CP F-x < CP curr then x is relevant calculate relevance increment, Rel x = ???????????? -????????? ???? ???????? # Else x is not relevant Step 2: Increase the weight of each relevant feature x, W x = W x (1 + Rel x ) For each irrelevant feature x, W x = 0 Normalise the weight vector Step 3: Perform K-means with adjusted weights Estimate the cluster purity CP new If CP new > CP curr Accept new weights CP curr = CP new Perform above steps till there is no improvement in the cluster purity. The final cluster solution thus formed consists of context-aware clusters with final set of relevant features and weights. # IV. Semi Supervised Classification Framework The However, in the presence of overlapping examples or outliers, the examples in a cluster may not strongly agree on a particular class and such cluster is not considered as uni-class / decisive cluster and is not labeled as they are considered as indecisive cluster. The final cluster solution formed in the training phase contains K clusters with each cluster containing examples belonging to one or more classes. The support of a class in a cluster S kj , is estimated in terms of true class labels of labeled examples and the predicted (probabilistic) class labels of unlabeled examples in the k th cluster. In a given cluster, the difference between the support available for majority class and its competing class reflects the decisiveness of the cluster in concurrence with the majority class. For this purpose, the authors propose a metric referred to as 'Purity Margin' which is measured for each cluster and is compared against purity threshold as detailed below. # b) Purity Threshold of the cluster The 'Purity Threshold', PT, of a cluster, C k , PT k is set as the minimum difference or margin, to be imposed between two competing classes in a cluster, for it to be considered as the decisive cluster. The purity threshold is estimated as a pre-defined fraction (?) of the product of cluster concurrence CC k and the number of classes in the dataset. In a dataset with q classes, purity threshold PT k , for a cluster C k is calculated as (Equation 6) PT k = ?.CC k . q6 Various experiments conducted on the value of ? shows that 0.1 which indicates 10% of support value, is a good measure to get optimum purity threshold. # c) Purity Margin of the cluster The purity margin measures the difference between the maximum support of a class in a cluster and the support of its immediate competitor class. Larger the margin, more pure the cluster is. Intuitively it is taken that it should be greater than or equal to the purity threshold. For a cluster C k , the purity margin PM(C k ) is calculated as (Equation 7) PM(C k ) = CC k -S kp where p is the competing class. 7 d) Decisive cluster A cluster C k , is considered to be a uni-class or a decisive cluster, if PM(C k ) ? PT k else it is considered as indecisive cluster. The decisive cluster is labeled with the majority class label i.e., the class label that has maximum support of the examples in the cluster, over all classes in the cluster. The indecisive cluster is left unlabeled and the details of the cluster including the predicted labels of unlabeled examples are stored to apply the weighted nearest neighbour classification while classifying any unknown / test example. # e) Hybrid Model for Classification The authors propose a hybridization of modelbased classification and instance-based classification for classifying any unknown / test example based on whether it is compatible to decisive cluster or an indecisive cluster. Let the cluster, C k be the most compatible cluster for unknown example x: ? If the cluster, C k , is decisive then ? Assign the cluster label, ?? ?? ?? to the example x. ? If the cluster is indecisive then ? Apply weighted nearest neighbor classification to predict the class label of x. f) Finding the most compatible cluster for unknown / test example Consider a set of clusters C={C 1 ,C 2 ,?,C K } with centroids as c= {c 1 ,c 2 ,?c K }. Weighted Euclidean distances are calculated between unknown / test example, t, and each centroid, c i . The cluster C k , which has the minimum distance among all the clusters is said to be the most compatible cluster for the example, t. Mathematically, it may be expressed as (Equation 8) k = ???????????? ????? ?? {d(t,c i )} 8 Hybrid model for classification is applied on the value of k as discussed earlier. Hence, the proximity of the unknown / test example, 't', to each class must be measured. The closer the example, 't', is to the neighborhood dominated by particular class label, it is more likely to share the same class label of its neighbors (Cluster Assumption). Accordingly, all the members of the most compatible cluster C k , are considered as neighbors with weights assigned in the inverse proportion of their squared distance to the test example. The proximity of the example, t, to a class label, p, denoted by W tp , is estimated by aggregating the weights of the members belonging to that particular class. Mathematically it may be expressed as (Equation 9) W tp = ? 1 ?? (??,??) 2 ????? ?? ?????? ?? ? ?? =?? 9 where d(t,i) is the Euclidean distance between t and i. This proximity estimate will ensure that the examples that are far (possibly an outlier) from the test example has less impact on prediction compared to the ones that are closer by. The unknown / test example is assigned the class label for which the proximity is maximum (Equation 10). V. # Global Journal of Computer # Experiments and Results # a) Experimental Setup The proposed model was implemented on Intel Pentium dual core processor with 3GB of DDR2 667 MHz memory and coded using .NET framework. SPSS statistic tool is used for performing discriminant analysis. Experiments were conducted on benchmark datasets obtained from UCI repository and one dataset from SPSS Inc. to test the performance of the proposed framework. Five binary datasets and six multi-class datasets were used in the experimentation as shown in table 1. The labels from some For binary class datasets, experiments were conducted with 100% labeled examples to assess the performance of the framework when all the examples in the datasets are labeled. However availability of labeled examples upto 100% does not call for semi supervised learning. The case with 100% labeled examples was demonstrated only to prove that the proposed method can handle datasets having less labeled examples in the similar way with datasets having 100% labeled maintaining consistently high performance. The complexity of cluster regularization and estimation of cluster concurrence and purity margin for development of hybrid classifier are not required for datasets having near 100% labeled examples and they may be better processed by an appropriate supervised learning algorithm. The performance of the model for multi-class datasets was analysed starting from 75%. In both the cases of clustering and classification, discriminant analysis is performed using SPSS statistics tool on the labeled examples in the datasets to produce the discriminant function(s). For binary class datasets, discriminant coefficients, and for multi-class datasets, potency index values are used to get the initial weights of the features in the dataset, which are referred to as initial weight vector. # b) Results In case of Semi Supervised Subspace Clustering, the cluster purity was estimated based on the cluster concurrence and the number of relevant features identified for the benchmark datasets are tabulated in table 2 and From Fig. 2 and Fig. 3, it is observed that the proposed model has consistent performance in term of cluster purity and not much change is observed with variation in percentage of labeled example. Only in the case of Zoo dataset, there has been huge decline in the cluster purity when there are few labeled examples. This is attributed to the fact, that number of examples in zoo dataset are only 101 and 15% of labeled data is very less compared to number of class labels and may not capture representatives from all the 7 class examples. In case of Semi Supervised Classification, the training sets of benchmark datasets are used to build the classifier and the accuracy of the classifier is tested on the test set where the predicted class labels are compared with true class labels of the test examples. These test results given in terms of accuracy is compared with the proven classifier models. The models considered for comparison are Weka implementation of C4.5 and an ensemble method, Bagging. Only one ensemble method is considered for comparison as all the other ensemble methods has similar performance on most of the datasets (Tan et al., 2006: Table 5.5). The results are tabulated in table 4 and 5 and a sample comparison graphs for a dataset in binary and multiple class is shown in Fig. 4 and Fig. 5. Experiments on the benchmark datasets shows that the proposed framework for both clustering and classification have performed consistently better for building models on the training set with varied range (75% to 15%) of labeled examples. When compared to other proven techniques, the proposed framework sustained its performance even when the number of labeled examples is reduced to 15% thus establishing its validity as a semi supervised learning model. The proposed framework was able to identify the relevant features along with their weightages thus reducing the information requirement for handling unknown situations may it be classification or clustering. # VI. # Conclusion In this paper, the authors proposed a framework for context-aware semi supervised learning in terms of both clustering and classification. The proposed framework is useful to work in the domains where availability of labeled data is either scarce or difficult/expensive to obtain. The framework with wrapper based feature selection is very much useful in handling high dimensional datasets. With dimensions reduced, a cluster and classification solution is defined with lesser number of features. This is very useful in cases where there are time and space constraints. The proposed framework not only identifies the relevant features but also estimates the importance of a feature in terms of weights such that cluster solutions are formed as per the intended purpose. Though the framework has used K-means for the formation of cluster solution, the proposed SFRE algorithm can be wrapped into any partitional clustering algorithm with equal ease for producing context-aware semi supervised subspace clusters leveraging a few labeled examples for defining the context. Since the model uses discriminant analysis for identifying attributes, it is limited to the numerical data. However, in reality, many of the applications contains mixed data, a combination of numeric and categorical data. This opens an avenue for further research to extend the model to work with categorical data. ![Given a dataset with 'l' number of labeled and 'u' number of unlabeled examples such that l<