# I. Introduction ociety is increasingly becoming more digitized and as a result organisations are producing and storing vast amount of data. Managing and gaining insights from the produced data is a challenge and key to competitive advantage. Web based social applications like people connecting websites results in huge amount of unstructured text data. These huge data contains a lot of useful information. People hardly bother about the correctness of grammar while forming a sentence that may lead to lexical syntactical and semantic ambiguities. The ability of finding patterns from unstructured form of text data is a difficult task. Data mining aims to discover previously unknown interrelations among apparently unrelated attributes of data sets by applying methods from several areas including machine learning, database systems, and statistics. Many researches have emphasized on different branches of data mining such as opinion mining, web mining, text mining. Text mining is one of the most important strategy involved in the phenomenon of knowledge discovery. It is a technique of selecting previously unknown, hidden, understandable, interesting knowledge or patterns which are not structured. The prime objective of text mining is to diminish the effort made by the users to obtain appropriate information from the collection of text sources [1]. Thus, our focus is on methods that extract useful patterns from texts in order to categorize or structure text collections. Generally, around 80 percent of company's information is saved in text documents. Hence text mining has a higher economic value than data mining. Current research in the area of text mining tackles problems of text representation, classification, clustering, information extraction or the search for and modelling of hidden patterns. Selection of characteristics, influence of domain knowledge and domain-specific procedures play an important role. The text documents contain large scale terms, patterns and duplicate lists. Queries submitted by the user on web search are usually listed on top retrieved documents. Finding the best query facet and how to effectively use large scale patterns remains a hard problem in text mining. However, the traditional feature selection methods are not effective for selecting text features for solving relevance issue. These issues suggests that we need an efficient and effective methods to mine fine grained knowledge from the huge amount of text documents and helps the user to get information quickly about a user query without browsing tens of pages. The paper provides a review of an innovative techniques for extracting and classifying terms and patterns. A user query is usually presented in list styles and repeated many times among top retrieved documents. To aggregate frequent lists within the top search results, various navigational techniques have been presented to mine query facets. The Organisation of the paper is as follows: Section 1 introduces a detailed overview of text mining frameworks, application and benefits of text mining. Sections 2 and 3 reviews feature selection, feature extraction and techniques of pattern extraction. Section 4 discusses various text classification and clustering algorithms in text mining. Sections 5 and 6 introduce a detailed overview of discovering facets and fine grained knowledge. Section 7 reviews the duplicate detection in text documents. Section 8 contains the conclusions. 1 Year 2016 # ( ) C a) Text Mining Models Text mining tasks consists of three steps: text preprocessing, text mining operations, text post processing. Text preprocessing includes data selection, Many approaches [2] have been concerned of obtaining structured datasets called intermediate forms, on which techniques of data mining [3] When documents contains terms with same frequency. Two terms can be meaningful while the other term may be irrelevant. Inorder to discover the semantic of text, the mining model is introduced. Figure 2 represents a new mining model based on concepts. The model is proposed to analyse terms in a sentence from documents. The model contains group of concept analysis, they are sentence based concept analysis, document based concept analysis and corpus based similarity measure [4]. Similarity measure concept based analysis calculates the similarity between documents. The model effectively and efficiently finds Benefits of text mining are better collection development to resolve user needs, information retrieval, to resolve usability and system performance, data base evaluation, hypothesis development. Information professionals(IP) [8] are always in forefront for emerging technologies. Inorder to make their product and service better and more efficient, usually libraries and information use these IP. The trained information professionals manage both technical and semantic infrastructures which is very important in text mining. IP also manages content selection and formulation of search techniques and algorithms. Akilan et al., [9] pesented the challenges and future directions in text mining. It is mandatory to function semantic analysis to capture objects relationship in the documents. Semantic analysis is computationally expensive and operates on few words per second as text mining consists of significant language component. An effective text refining method has to be developed to process multilingual text document. Trained knowledge specialists are neceessary to deal with products and application of current text mining tools. Automated mining operations is required which can be used by technical users. Domain Knowledge plays an important role in both at text refining stage and knowledge distillation and hence helps in improving the efficiency of text mining. Sanchez et al., [10] presented Text knowledge mining (TKM) based deductive inference that is usually targeted on the feasible subset of texts which usually search for contradictions. The procedure obtains new knowledge making a union of intermediate forms of texts from accurate knowledge expressed in the text. Dai et al., [11] introduced competitive intelligence analysis methods FFA (Five Faces Frame work) and SWOT with text mining technologies. The knowledge is extracted from the raw data while performing transforming process that enables the business enterprises to take decisions more reliably and easily. Mining Environment for Decisions (MinEDec) system is not evaluated in real business environments. Hu et al., [12] presented a interesting task of automatically generating presentation slides for academic papers. Using a support vector regression method, importance scores of sentences in the academic papers is provided. Another method called Integer Linear Programming is used to generate well structured slides. The method provides the researchers to prepare draft slides which helps in final slides used for presentation. The approach does not focus on tables, graphs and figures in the academic papers. # c) Traffic based Event in Text Mining Andrea et al., [13] [14] have proposed a realtime monitoring system for traffic event detection that fetches tweets, classifies and then notifies the users about traffic events. Tweets are fetched using some text mining techniques. It provides the class labels to each tweet that are related to a traffic event. If the event is due to an external cause, such as football match, procession and manifestation, the system also discriminate the traffic event. Final result shows it is capable of detecting traffic event but traffic condition notifications in real-time is not captured. An efficient and scalable system from a set of microblogs/ tweets has been proposed to detect Events from Tweets (ET) [15] by considering their textual and temporal components. The main goal of proposed ET system is the efficient use of content similarity and appearance similarity among keywords and to cluster the related keywords. Hierarchical clustering technique is used to determine the events, which is based on common co-occurring features of keywords [16]. ET is evaluated on two different datasets from two different domains. The results show that it is possible to detect events of relevance efficiently. The use of semantic knowledge base like Yago is not incorporated. Schulz et al., [17] proposed a machine learning algorithm which includes text classification and increasing the semantics of the microblog. It identifies the small scale incidents with high accuracy. It also precisely localizes microblogs in space and time which enables it to detect incidents in real time. The algorithm will not only give us information about the incident and in addition give us valuable information on previous unknown information about the incidents. It does not considers NLP techniques and large data. ITS (Intelligent Transportation Systems) [18] recognizes the traffic panels and dig in information contained on them. Firstly, it applies white and blue color segmentation and then at some point of interest it derives descriptors. These images that can now be considered as sack of words and classified using Na¨?ve Bayes or SVM (state vector method). The kind of categorization where the images are classified based on visual appearance is new for traffic panel detection and it does not recognize multiframe integration. Text may be loosely organized without complete information in the documents and may also contain omitted information. The text has to be scanned attentively to determine the problems. If it is not scanned and scrutinised properly then it leads to poor accuracy on unstructured data and hence preprocessing is necessary. Preprocessing guarantees successful implementation of text analysis, but may spend substantial processing time. Text processing can be done in two basic methods. a)Feature Selection b) Feature Extraction. Research in numerous fields like machine learning, data mining, computer vision, statistics and linked fields has led to diversity of feature selection approaches in supervised and unsupervised surroundings. Feature Selection (FS) has an important role in data mining in categorization of text. The centralized idea of feature selection is the reduction of the dimension of the feature set by determining the features appropriately which enhances the efficiency and the performance. FS is a search process and categorized into forward search and backward search. Mehdi et al., [19] [20] executed a innovative feature selection algorithm based on Ant Colony Optimization (ACO). Without any prior knowledge of features, a minimal feature subset is determined by applying ACO [21]. The approach uses simple nearest neighbor classifier to show the effectiveness of ACO algorithm by reducing the computational cost and it outperforms information gain and chi methods. Complex classifiers and different kinds of datasets are not incorporated. Combining feature selection algorithm with other population-based feature selection algorithms are not considered. Gasca et al., [22] proposed feature selection method based on Multilayer Perceptron (MLP). Under certain objective functions the approach determines and also corrects proper set of irrelevant set of attributes. It further computes the relative contributions for individual attribute in reference to the units that are to be output. For each output unit, contribution are sorted in the descending order. An objective function called prominance is computed for each attribute. Selecting the features from large document faces problem in unsupervised learning because of unnamed class labels. Sivagaminathan et al., [23] [24] proposed a fixed size subset, an hybrid approach to solve feature subset selection problem in neural network pattern classifier. It considers both the individual performance and subset performance. Features are selected using the pheromone trail and value of heuristic by state transition rules. After selecting the feature, the global updating rule takes place to increment the features, which ultimately gives better classification performance without increase in the overall computational cost. Ogura et al., [28] proposed an approach to reduce a feature dimension space which calculates the probability distribution for each term that deviates from poissons. These deviations from poissons are non significant for the documents that does not belong to category. Three measures are employed as a benchmark and by using two classifiers SVM and K-NN gives better performance than other conventional classifiers. Gini index proved to be better than chisquare, IG in terms of macro, micro average values of F1. These measures do not utilize the number of times the term occurs in a document. The computational complexity could not be to suppressed for other typical measures such as information gain and CHI. # Global Journal of Computer Science and Technology Volume XVI Issue V Version I 4 Year 2016 ( ) C Feature selection is measured based on words term and document frequency. Azam et al., [29] observes these frequencies for measuring FS. The metrics of Discriminative Power Measure (DPM) and GINI index (GINI) are incorporated and the term frequency based metric is useful for small feature set. The most important features returned by DPM and GINI tend to discover most of the available information at a faster rate, i.e. against lower number of features. The DPM and GINI are comparatively slower in covering document frequency information. Yan et al., [30] presented a graph embedded framework for dimensionality reduction. The framework is also used as a tool and unifies many feature extraction methods. Feature is selected based on spectral graph theory and proposed framework unifys both supervised and unsupervised feature selection. Zhao et al., [31] developed a framework for preserving feature selection similarity to handle redundant feature. A combined optimization formulation of sparse multiple output regression formulation is used for selecting similarity preserving features. The framework do not address existing kernel, metric learning methods and semi-supervise feature selection methods. # 1) Feature Selection based Graph Reconstruction:A Major task in efficient data mining is Feature selection. Feature selection has a significant challenge in small labeled-sample problem. If data is unlabeled then it is large. If the label of data is extremely tiny, then supervised feature selection algorithms fail for want of sufficient information. Zhao et al., [32] introduced graph regularized data construction to overcome the problems in feature selection. The approach achieves higher clustering performance in both unsupervised and supervised feature selection. Linked social media crops enormous amount of unlabeled data. In the prevailing system, selecting features for unlabeled data is a difficult task due to the lack of label information. Tang et al., [33] proposed an unsupervised feature selection framework, LUFS(Linked Unsupervised Feature Selection), for related social media data to surpass the problem. The design builds a pseudo-class labels through social dimension extraction and spectral analysis. LUFS efficiently exploits association information but does not exploit link information. Computer vision and pattern recognition problems are the two main problems which have inherent manifold structure. A laplacian regularizer is included to smoothen the clustering process along with the scale factor. In text mining applications, several existing systems incorporate a NLP-basedtechniques which parse the text and promote the usage patterns that is used for mining and examination of the parse trees that are trivial and complex. Mousavi et al., [34] have formulated a weighted graph depiction of text, called Text Graphs that further captures grammar which serve as semantic dealings between words that are in textual terms. The text based graphs incorporates such a framework called SemScape that creates parse trees for each sentence and uses two step pattern based procedure for facilitation of extraction from parse trees candidate terms and their parsable grammar. Due to the absence of label information, it is hard to select the discriminative features in unsupervised learning. In the prevailing system, unsupervised feature selection algorithms frequently select the features that preserve the best data dissemination. Yang et al., [35] proposed a new approach that is L2, 1 -norm regularized Unsupervised Discriminative Feature Selection (UDFS). The algorithm chooses the most discriminative feature subset from the entire feature set in batch mode. UDFS outclasses the existing unsupervised feature selection algorithms and selects discriminative features for data representation. The performance is sensitive to the number of selected features and is data dependent. Cai et al., [36] presented a novel algorithm, called Graph regularized Nonnegative Matrix Factorization (GNMF) [37], which explicitly considers the local invariance. In GNMF, the geometrical information of the data space is pre-arranged by building a nearest neighbor graph and gathering parts-based representation space in which two data points are adequately close to each other, if they are connected in the graph. GNMF models the data space as a sub manifold rooted in the ambient space and achieves more discriminating power than the ordinary NMF approach. Fan et al., [38] suggested a principled vibrational framework for unsupervised feature selection using the non Gaussian data which is subjective to several applications that range from several diversified domains to disciplines. The vibrational frameworks provides a deterministic alternative for Bayesian approximation by the maximization of a lower bound on the marginal probability which has an advantage of computational efficiency. # 2) Text summarization and Dataset: Several approaches have been developed till date for automatic summarization by identifying important topic from single document or clustered documents. Gupta et al., [39] describes a topic representation approach that captures the topic and frequency driven approach using word probability which gives reasonable performance and conceptual simplicity. Negi et al., [40] developed a system that summarizes the information from a clump of documents. The proposed system constructs the information from the given text. It achieves high accuracy but cannot calculate the relevance of the document. Debole et al., [41] initially explains the three phases in the life cycle of TC system like document indexing, classifier learning and classifier evaluation. All researches takes Reuters 21578 documents for TC experiments. Several researches have used Modapte split for testing. The three subsets used for the experiments are a set of ten categories with more number of positive training examples. Xie et al., [42] proposed an approach to the acquisition of the semantic features within phrases from a single document that extracts document keyphrases. Keyphrase extraction method always performs better than TFIDF and KEA. Keyphrase extraction is a basic research in text mining and natural language processing. The method is developed on the concept of semantic relatedness where degrees between phrases are calculated by the cooccurrences between phrases in a given document and the same is presented as a relatedness graph. The approach is not domain specific and generalizes well on journal articles and is tested on news web pages. To obtain any online information is an easy task. We log on to the world wide web and give simple keywords. However, it is not easy for the user to read the entire information provided. Hence text summarization is needed. # b) Feature Extraction Zhong et al., [44] has presented an effective pattern discovery technique which includes the process of pattern deploying and pattern evolving as shown in Table 2, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. The proposed model outperforms other pure data mining-based methods, the concept based models and term-based state-of the-art models, such as BM25 and SVM. Li et al., [47] proposed two algorithms namely Fclustering and Wfeature to discover both positive and negative patterns in the text documents. The algorithm Fclustering classifies the terms into three categories general, positive, negative automatically without using parameters manually. After classifying the terms using Fclustering, Wfeature is executed to calculate the weights of the term. Wfeature is effective because the selected terms size is less than the average size of the documents. The proposed model is evaluated on RCV, Trec topics and Reuters 21578 dataset as shown in Table 2, the model performs much better than the term based method and pattern based method. The use of # Collection of Text Documents # E Extract Useful Features # Feature Weight Specificity Data Fusion # Relevant features with Duplicate Free Duplicate Detection irrelevance feedback strategy is highly efficient for improving the overall performance of relevance feature discovery model. Xu et al., [26] experimented on microblog dimensionality reduction-A deep learning approach. The approach aims at extracting useful information from large amount of textual data produced by microblogging services. The approach involves mapping of natural language texts into proper numerical representations which is a challenging issue. Two types of approaches namely modifying training data and modifying training objective of deep networks are presented to use microblog specific information. Meta-information contained in tweets like embedded hyperlinks is not explored. Nguyen et al., [49] worked on review selection using Micro-reviews. The approach consists of two steps namely matching review sentences with micro reviews and selecting a few reviews which cover many reviews. A heuristic algorithm performs computionally fast and provides informative reviews. # III. Pattern Extraction Patterns which are close to their super patterns that appears in the same paragragh are termed closed relation and needs to be eliminated. The shorter pattern is not considered since it is meaningless while the longer pattern is more meaningful and hence these are significant patterns in the pattern taxonomy. Abonem et al., [53] presented text mining framework that discovers knowledge by preprocessing the data. Usually text in the documents contains words, special characters and structural information and hence special characters is replaced by symbols. It mainly focuses on refining the uninterested patterns and thus fitering decreases the time and size of search space needed for the discovery phase. It is more efficient when large collection of documents are considered. Postprocessing involves pruning, organizing and ordering of the results. The rule of each document is to find a set of characteristics phrases and keywords i.e., length, tightness and mutual confidence. The ranking of the rules within a document is measured by calculating a weight for each rule. Mining entire set of frequent subsequence for every long pattern generates uncontrollable number of frequent subsequence which are expensive in space and time. Yan et al., [54] proposed a solution for mining only frequent closed subsequence through an algorithm Clospan-Closed Sequential Pattern Mining. Clospan efficiently mines frequent closed sequences in large data sets with low minimum support but does not take advantage of search space pruning property. Gomariz et al., [55] presented CSpan algorithm for mining closed sequential patterns which mines closed sequential patterns early by using pruning method called occurence checking. CSpan outperforms clospan and claspalgorithm. # Global Journal of Computer Science and Technology Volume XVI Issue V Version I # b) Mining Sequential Pattern To delimit the search and to increase the subsequence fragments Han et al., [57] proposed Freespan Frequent Pattern Projected sequential pattern Mining. Freespan fuses the mining of frequent sequence with that of frequent patterns and adopts projected sequence databases. Freespan runs quicker than the Apriori based GSP algorithm. Freespan is highly scalable and processing efficient in mining complete set of patterns. Freespan causes page thrashing as it requires extra memory. With extensive applications in data mining, mining sequential pattern encounters problems with a usage of very large database. Pei et al., [58] proposed a sequential pattern mining method called Prefix Span(Prefix Projected sequential pattern mining). The complete set of patterns is extracted by reducing the generation of candidate subsequence. Further prefix projection largely reduces projected database size and greatly improves efficiency as shown in Table 3. Making use of RE(Regular Expression) [59] as a flexile constraint SPIRIT algorithm was proposed by Garofalakis et al., [60] for mining patterns that are sequential. A family of four algorithms is executed for forwarding a stronger relaxation of RE. Candidate sequence containing elements are pruned that do not appear in RE than its predecessor in the pattern mining loop. The degree to which RE constraints are enforced to prune the search space of patterns are the main distinctive factor. The results on the real life data shows RE's adaptability as a user level tool for focussing on interesting patterns. Jian et al., developed a new framework called Pattern Growth [PG]. PG is based on prefix monotonic property. Every monotonic and anti monotonic regular expression constraints are preprocessed and are pushed into a PG-based mining algorithm. PG adopts and also handles regular expression constraints which is diffi cult to explore using Apriori based method like SPIRIT. The candidate generation and test framework adopted by PG is less expensive and efficient in pushing many constraints than SPIRIT method. During Prefix growth various irrelevant sequence can be excluded in the huge dataset. Accordingly, projected database quickly shrinks. While PG outperforms SPIRIT, interesting constraints specific to complex structure mining is not be explored. To filter the discovered patterns, Li et al., [43] [61] proposed an effective pattern discovery technique that deploys and evolves patterns to refine the discovered patterns. Using these discovered patterns, the relevant information can be determined inorder to improve the effectiveness. All frequent short patterns and long patterns are not useful and also long patterns with high specificity suffers from the low problem frequency. The problem of low frequency and misinterpretation for text mining can be solved by employing pattern deploying strategies. Rather than using individual words, some researches used phrases to discover relevant patterns from documents collection. Hence there is a small improvement in the effectiveness of text mining becauses phrases based methods have consistency of assignment and document frequency for terms to be low. Inje et al., [62] used a pattern based taxonomy(is-a) relation to represent document rather than using single word. The computation cost is reduced by pruning unwanted patterns and hence improves the effectiveness of system. Bayardo et al., [63] evaluated Max miner algorithm inorder to mine maximal frequent itemsets from large databases. Max-Miner reduces the space of itemsets considered through superset-frequency based pruning. There is a performance improvements over Apriori-like algorithms when frequent itemsets are long and more modest though still substantial improvements when frequent itemsets are short. Completeness at low supports on complex datasets is not achieved. Jan et al., [64] [65] proposed propositionalization and classification that employs long first order frequent patterns for text mining. The Framework solves three text mining tasks such as information extraction, morphological disambiguation and context sensitive text correction. Propositionalization approach outperforms CBA by using frequent patterns as features. The performance of CBA classifiers greatly depends on number of class association rules and threshold values given by the user. The proposed framework shows that the distributed computation can improve performance of both method since large sample of data and a larger number offeatures are extracted. Seno et al., [66] proposed an algorithm SLP miner that finds all sequential patterns. It performs effectively satisfying length decreasing support constraint and increases in average length of the sequence. It is expensive as pruning is not considered in this work. Nizar et al., [67] demonstrates a taxonomy of sequential pattern mining techniques. Reducing the search space can be done by strongly minimizing the support count. Domain knowledge, distributed sequence are not considered in the mining process. # c) Mining Frequent Sequences To extract sequential patterns, various algorithms have been executed by making continuously repeated scans of database and making use of hash structure. # Global Journal of Computer Science and Technology Volume XVI Issue V Version I 8 Year 2016 ( ) Zaki et al., [68] presented a new novel algorithm SPADE for discovering sequential patterns at a high speed. SPADE decomposes the parent class into small subclasses. These sub problems are executed without depending on other subproblems in main memory by lattice approach. The lattice approach needs only one scan when having some pre-processed data. They also process depth first search and breadth first search for frequent sequence enumeration within each sublattice. By using these search strategies SPADE minimizes the computational costs and I/O costs by reducing number of database scans. It provides pruning strategies to identify the interesting patterns and prune out irrelevant patterns. BFS outperforms DFS by having more information available for pruning while constructing a set of three sequence, two sequence. BFS require more main memory than DFS. BFS checks the track of idlists for all the classes, while DFS needs to preserve intermediate id lists for two consecutive classes along a specific path. Han et al., [69] proposed a FP(frequent pattern tree) structure where the complete set of frequent patterns can be extracted by pattern fragment growth. Three techniques are used to achieve mining efficiency compression of the database, (i) FP tree avoids expensive repeatedly scanning database (ii) FP tree prevents generation of large number of candidates sets and uses divide and conquer method which breaks the mining task into a set of tasks that lowers search space. FP growth method [70] is efficient and also scalable for extracting both long and short frequent patterns and it is faster than Apriori algorithm. Zhang et al., [71] executed CFP Constrained Frequent Pattern algorithm to improve the efficiency of association rule mining. The algorithm is incorporated in an interrelation analysis model for celestial spectra data. The module extracts correlation among the celestial spectra data characteristics. The model do not support for different application domain. # d) Mining Frequent itemsets using Map Reduce Database Management System have evolved over the last four decades and now functionally rich. Operating and managing very large amount of business data is a challenging task. MapReduce [72] [73] is a framework that process and manages a very large datasets in a distributed clusters efficiently and achieves parallelism. Xun et al., [74] [75] executed Fidoop algorithm using mapreduce model. Fidoop algorithm uses frequent itemset with different lengths to improve workload balance metric across clusters. Fidoop handles very high dimensional data efficiently but do not work on heterogeneous clusters for mining frequent itemsets. Wang et al., [76] proposed (FIMMR) Frequent Itemset Mining Mapreduce Framework algorithm. The algorithm initially extracts lower frequent itemset, applies pruning technique and later mines global frequrnt itemset. The speedup of algorithm is satisfactory under low minimum support threshold. Ramakrishnudu et al., [77] finds infrequent itemset from huge data using mapreduce framework. The efficiency of framework increases as the size of the data is increased. The framework produces few intermediate items during the process. Ozkural et al., [78] extracts frequent item set by partitioning the graph by a vertex separator. The separator mines the item distribution independently. Parallel frequent itemset algorithm replicates the items that co-relate with the separator. The algorithm minimizes redundancy and load balancing is achieved. Relationship among a very large number of items for real world database is not incorporated. # e) Relevance Feedback Documents Xu et al., [79] presented a Expectation Maximization(EM) algorithm for relevance feedback inoverlaps in feedback documents. Based on dirichlet compound multinominal(DCM) distribution, EM includes a background collection model reduction, by the methodology of deterministic annealing and query based regularization. Several Queries which do not contain any relevance feedback needs improvisations by combining pseudo relevance feedback and relevance feedback using a hybrid feed-back paradigm. Instead of using static regularization, the authors adjust the regularization parameter based on the percentage of relevant feedback documents [80]. Further, the design formulates the space for a much newer document progressively. The weighted relevance is computed for an experimental design which further exploits the top retrieved documents by adjusting the selection scheme. The relevance score algorithms need to be validated on several TREC datasets. Cao et al., [81] re-examined the assumption of most frequent terms in the false feedback documents that are useful and prove that it does not hold in reality. Distinguishing good and bad expansion terms cannot be done in the feedback documents. The difference of term distribution between feedback documents and whole document collection is exploited through the mixture model indicates that good and bad expansion terms may have similar distributions that fails to distinguish. Experiments are conducted to see that each query can keep only the good expansion terms. The new query model integrates the good terms, while classification of term is done to improve the effectiveness of retrieval. In a final query model, the classification score is used to enhance the weight of good terms. Selecting expansion terms are significantly better than traditional expansion terms by evaluating on three TREC datasets. Selection of terms has to be done carefully. Pak et al., [82] proposed a automatic query expansion algorithm which incorporates a incremental blind approach to choose feedback documents from the top retrieved lists and further finds the terms by aggregating the scores from each feedback document. The algorithm performs significantly better on large documents. Algarni et al., [83] proposed the adaptive relevance feature discovery(ARFD). Using a sliding window over positive and negative feedback, that ARFD updates the systems knowledge. The system provides a training documents where specific features are discovered. Various methods have been used to merge and revise the weight of the feature in a vector space. Documents are selected based on two categories. The first category is that user provide the interested topic information and the second category is that the user changed the interest topic. # IV. Text Classification and Clustering Text categorization [84] is a significant issue in text mining. In general, the documents contains large texts and hence it is necessary to classify them into specific classes. Text categorization can be broadly classified into supervised and unsupervised classification. Classifying documents manually is very costly and time consuming task. Hence it is necessary to construct automaic text classifiers using pre-classified sample documents whose time efficiency and accuracy is much better than manual text classification. Computer programs often treat the document as a sack of words. The main characteristics of text categorization is feature space having high dimensionality. Even for moderate sized text documents, the feature space consists of hundreds and thousands of terms. Sebastiani et al., [85] reviews the standard approaches that comes under machine learning paradigm for text categorization. The approach also describes the prob-lem faced while document representation constructing classifiers and evaluation of constructed classifier. The experimental study shows comparisons among different classifiers on different versions of reutor dataset. Text categorization is a good benchmark for clarrifying whether a given learning technique can scale up to substantial sizes. Irfan et al., [86] reviews different pre-processing techniques in text mining to extract various textual patterns from the social networking sites. To explore the unstructured text available from social web, the basic approaches of text mining like classification and clustering are provided. Wu et al., [87] presents a technique consisting of three preprocessing stages to recognize the text region of huge size and contrast data. A Segmentation algorithm cannot identify the changes that happen both in color and illumination of character in a document image. The technique followsextracting the grayscale image such as from the book cover, magazine RGB plane associated with weighted valve. A multilevel thresholding process is done on each grayscale image independently to identify text region. A recursive filter is executed to interpret which connects components is textual components. An approach to determine score is considered to findout the probabilistic text region of resultant images. If the text region has maximum score, then it is classified as textual component. # V. Discovering Facets for Queries from Search Result Facets means a phrase or a word. A query facet is a set of items which summarize an important aspect of a query. Dou et al., [88] [89] [90] explores solution of searching the set of facets for a user query. A system called Query Discovery (QD) miner is proposed to mine facets automatically. Expermiments are conducted for 100's of queries and results shows the effectiveness of the system as shown in the table 5. It provides interesting knowledge about a query and however improves searching for the users in different ways. The problem of generating query suggestions based on query facets is not considered that might help users find a better query more easily. Multifacted search is an important paradigm for extracting and mining applications that provides users to analyze, navigate through multidimensional data. Facetted search [91] can also be applied on spoken web search problem to index the metadata associated with audio content that provides audio search solution for rural people. The query interface ensures that a user is able to narrow the search results fastly. The approach focuses on indexing system and not generating precision -recall results on a labeled set of data. Kong et al., [96] incorporated the feedback of users on the query facets into document ranking for evaluating boolean filtering feedback models that are widely used in conventional faceted search which automatically generates the facets for a user given query instead of generating for a complete corpus. The boolean filtering model is less effective than soft ranking models. Bron et al., [97] proposed a novel framework by adding type filtering based on category information available in wikipedia. Combining a language modelling approach with heuristic based on wikipedia's external links, framework achieves high recall scores by finding homepages of top ranked entities. The model returns entities that have not been judged. Navarro et al., [98] develops an automatic facet generation framework for an efficient document retrieval. To extract the facets a new approach is developed # Global Journal of Computer Science and Technology Volume XVI Issue V Version I 10 Year 2016 ( ) which is both domain independent and unsupervised. The approach generates multifaceted topic effectively. The subtopics in the text collection is not investigated. Liu et al., [99] presented the study of exploring topical lead lag across corpora. Selecting which text corpus leads and which lags in a topic is a big challenge. Text pioneer, a visual analytic tool is introduced. The tool investigates lead lag across corpora from global to local level. Multiple perspectives of results are conveyed by two visualizations like global lead lag as hybrid tree, local lead lag as twisted ladder. Text pioneer donot analyze topics within each corpus and across corpora. Jiang et al., [100] presented Cross Lingual Query Log Topic Model (CL-QLTM) to investigate query logs to derive the latent topics of web search data. The model incorporates different languages by collecting cooccurence relations and cross lingual dictionaries from query log. CL-QLTM is effective and superior in discovering latent topics. The model is not applied on statistical machine translation. Cafarella et al., [101] exploited the interesting knowledge from webpages which consists of higher relevance to user when compared to traditional approach. The system records co-occurences of schema elements and helps user in navigating, creating synonyms for schema matching use. Wordnet Domains text document. The queries given by the user is free text queries. Mapping keywords to different attributes and their values of a given entity is a challenging task. Castanet is simple and effective that achieves higher quality results than other automated category creation algorithms. WordNet is not exhaustive and few other mechanism is needed to improve coverage for unknown terms. Pound et al., [102] proposed a solution that exploits user facetted search behaviour and structured data to find facet utility. The approach captures values and conditional values that provides attributes and values according to user preferences. Experi Space Efficient Framework Robust Multiple patterns are not handled ment results show that the approach is scalable and also outperforms popular commercial systems. Altingovde et al., [103] demonstrate static index pruning technique by incorporating query views like document and term centric. The technique improves the quality of top ranked result. When the web pages changes frequently the original index is not updated. Koutris et al., [104] proposed a framework for pricing the data based on queries. The polynomial time algorithm is executed for a conjunctive queries of large class and the result shows that the data complexity instance based determincy is CO NP complete. The framework do not explore interaction between pricing and privacy. Liu et al., [106] developed a tool that automatically differentiate structured data from search results. A feature type based approach is introduced which identifies a valid features and evaluates the quality of features using exact and heuristics computation methods. The method achieves local optimality avoids dependency on random initialization. Result differentiation (whether the selected features is interest to users are not) is not incorporated. Liu et al., [107] proposed matrix representation to discover collection of documents based on user interest. The multidimensional visualization is presented to overcome the difficulty for users to compare across different facet values. The approach further enables visual ordering based on facet values to support cross facet compa risons of items and also support users in exploring tasks. The intradocument details are unavailable and visual scalability is not incorporated. [105] proposed two methodlogy for extracting user tasks when they search for relevant data from search engine. The method identifies user query logs and further aggregate same kind of users tasks based on supervised and unsupervised approaches. The method is effective in detecting similar latent needs from a query log. Users task by task search behaviour is not represented in the model. # C Colini et al., [111] [112] proposes multiple keyword method that provides search auctions with budgets and bidders. Bidders is bounded by multiple slots per keyword. Bidders which have cumulative valuations are click through rates and budgets that confine the overall study of multiple keyword method. Multiple keywords mechanism is compatible, optimal and rational with expectation. In combinatorial setting, each bidder is having a direct involvement in a subset of keywords. Deterministic mechanisms with temper marginal valuations are incompatible. Wu et al., [113] introduced the concept of safe zones. It studies the moving of top K keyword query. The safe zones saves the time and communication cost. The approach computes safe zone in order to optimize server side computation. It is also used to establish the client server communication. Spatial keyword is not processed and also the safe zone do not compute future path of moving the query. Lu et al., [114] proposed reverse spatial keyword K nearest neighbour to find the query of object which is similar to one of the neighbour. The query search is based on spatial location and also text associated with it. The algorithm is used to prune unnecessary objects and also computes the lists. The method do not considers textual description of two different objects. Cao et al., [115] demonstrates the concept of weighing a query. The spatial keywords match considers both the location and the text. The method focuses more on finding queries to group of objects by grouping spatial objects. Top K spatial keyword and weighing of query improves the performance and efficiency. The computational time is reduced but partial coverage of queries is not considered. # VI. Fine Grained Knowledge Guan et al., [116] suggested "tcpdump" method to capture the web surfing activities from users. Web surfing activities reflects persons fine grained knowledge by recognizing the semantic structures. Further by using Dirichlet process infinite Guassian mixture model is adopted. D-iHMM process is employed for mining the fine grained aspect in each part by session clustering. Discovering fine grained knowledge Feature Extraction and Duplicate Detection for Text Mining: A Survey Hon et al., [95] developed space efficient frame works for top k string retrieval problem that considers two metrics for relevance features which includes frequency and proximity. The threshold based approach on these metrics are also been used. Compact space and sufficient space indexes are derived that results index space and query time with significant robustness. The framework is robust but do not index an the cache oblivious model and also the index takes twice the size of the text. Multiple patterns are not handled. Zhang et al., [94] proposed (SPP) Space Partition and Probing to keep track of object position and relevance to the query and also to find the vector space. Quality is achieved by using MMR which is one of the important diversification algorithm. The method identifies the next top K object very quickly. SPP helps in reducing object axis and also increases the performance. Fixed bounded region is not considered. Zhang et al., [93] proposed inverted linear quadtree index structure to accomplish both spatial and keyword based techniques to effectively decreases the search space. Spatial keyword queries having two disputes: top k spatial keyword search(TOPK-SK) and batch top k spatial keyword search(BTOPK-SK), in which top-sk fetch the closest k objects which contains all keywords in the query. BTOPK-SK contains set of top k queries. Existing techniques in IL-quadtree presents firstly Keyword first index, which is to extract the related inverted indexes. Partition based method is proposed to further enhance the filtering capabilities of the signature of linear quadtree. Efstathiades et al., [92] presents Link of Interest (LOI) to improve the quality of users queries. K Relevant Nearest Neighbor(K-RNN) queries is based on query processing method is proposed to analyse LOI information to retrieve relevant location based point of interest as shown in Table 3. The method captures the relevance aspect of data. Relevance score is not computed. Catallo et al., [108] proposed probabilistic k-Sky band to process subset of sliding window objects, that are most recent data objects. The algorithm out performs for parameter of large values of K parameter both in memory consumption and time reduction. Adaptive top K processing is not incorporated in the approach. Bast et al., [109] presented pre-processing techniques to achieve interactive query times on large text collections. Two similarities measures are considered which includes, firstly, query terms match -similar terms in collection. Secondly, Query terms match -terms with similar prefix in collection which display the results quickly and are more efficient and scalable. Termehchy et al., [110] introduced the XML stru cture for searching the keyword effectively. Traditional keyword search techniques does not support effectively. In order to overcome these problems for data-centric, XML put forth the Coherency Ranking(CR), which is a database design self sustained ranking method for XML keyword queries that is based on prolonging concepts of data dependencies and mutual information. With the concepts of CR, that analyze the prior approaches to XML keyword search. Approximate coherency ranking and current potent algorithm process queries and rank their responds using CR. CR shows better precision and recall, provides better ranking than prior approaches. reflected from people's interaction made knowledge sharing in collaborative environment much easier. Although privacy is major issue. Wang et al., [117] analysed user's searching behaviors and considered inter-query dependencies. A semi-supervised clustering model is proposed based on the SVM framework. The model enables a more comprehensive understanding of user's searchbehaviors via query search logs and facilitates the development search-engine support for long-term tasks. The performance of the model is superior in identifying cross-session search. User modeling and long-term task based personalization is not considered. Kotov et al., [118] proposed a method for creating a semi automatically labeled data set that can be used for identifying user's query searches from earlier sessions on the same task and to predict whether a user returns to the same task during his later session. Using logistic regression and MART classifiers the method can effectively model and crosssession of user's information needs. The model is not incorporated in commercial search engines. # VII. Duplicate Detection and Data Fusion Duplicate detection is the methodology of identification of multiple semantic representation of the existing and similar real world entities. The present day detection methods need to execute larger datasets in the least amount of time and hence to maintain the overall quality of datasets is tougher. Papenbrock et al., [119] proposed a strategic approach namely the progressive duplicate detection methods as shown in Table 4 which finds the duplicates efficiently and reduces the overall processing time by reporting most of the results as shown in table 7 than the existing classical approaches. Bano et al., [120] executed innovative windows algorithm that adapts window for duplicates and also which are not duplicates and unnecessary comparisons is avoided. The duplicate records are a vital problem and a concern in knowledge management [124]. To Extract duplicate data items an entity resolution mechanism is employed for the procedure of cleanup. The overall evaluation reveals that the clustering algorithms perform extraordinarily well with accuracy and f-measure being high. Whang et al., [125] investigates the enhancement of focusing on several matching records. Three types of hints that are compatible with different ER algorithms:(i) an ordered list of records, (ii) a sorted list of record pairs, (iii) a hierarchy of record partitions. The underlying disadvantage of the process is that it is useful only for database contents. 13 Year 2016 # ( ) C Duplicate records do not share a strategic key but they build duplicate matching making it a tedious task. Errors are induced because the results of transcription errors, incomplete information and lack of normal formats. Abraham et al., [126] [127] provides survey on different techniques used for detecting duplicates in both XML and relational data. It uses elimination rule to detect duplicates in database. Elmagarmid et al., [128] present intensive analysis of the literature on duplicate record for detection and covers various similarity metrics, which will detect some duplicate records in exceedingly available information. The strengths of the survey analysis in statistics and machine learning aims to develop a lot of refined matching techniques that deem probabilistic models. Deduplication is an important issue in the era of huge database [129]. Various indexing techniques have been developed to reduce the number of record pairs to be compared in the matching process. The total candidates generated by these techni-ques have high efficiency with scalability and have been evaluated using various data sets. The training data in the form of true matches and true non matches is often unavailable in various real-world applications. It is commonly up to domain and linkage experts for decision of the blocking keys. Papadakis et al., [122] presented a blocking methods for clean-clean ER over Highly Heterogeneous Information Spaces (HHIS) through an innovative framework which comprises of two orthogonal layers. The effective layer incorporates methods for construction of several blockings with small probability of hits; the efficiency layer comprises of a rich variety of techniques which restricts the required number of pairwise matches. Papadakis et al., [123] focuses to boost the overallblocking efficiency of the quadratic task on Entity Resolution among large, noisy, and heterogeneous information areas. The problem of merging many large databases is often encountered in KDD. It is usually referred to as the Merge/Purge problem and is difficult to resolve in scale and accuracy. The Record linkage [130] is a wellknown data integration strategy that uses sets for merging, matching and elimination of duplicate records in large and heterogeneous databases. The suffix grouping methodology facilitates the causal ordering used by the indexes # C for merging blocks with least marginal extra cost resulting in high accuracy. An efficient grouping similar suffixes is carried out with incorporation of a sliding window technique. The method is helpful in various health records for understanding patient's details but is not very efficient as it concentrates only on blocking and not on windowing technique. Additionally the methodology with duplicates that are detected using the state of the art expandable paradigm is approximate [131]. It is quite helpful in creating clusters records. Bronselaer et al., [132] focused on Information aggregation approach which combine information and rules available from independent sources into summarization. Information aggregation is investigated in the context of inferencing objects from several entity relations. The complex objects are composed of merge functions for atomic and subatomic objects in a way that the composite function inherits the properties of the merge functions. Sorted Neighborhood Method (SNM) proposed by Draisbach et al., [133] partitions data set and comparison are performed on the jurisdiction of each partition. Further, the advances in a window over the data is done by comparison of the records that appears within the range of same window. Duplicate Count Strategy (DCS) which is a variation of SNM is proposed by regulating the window size. DCS++ is proposed which is much better than the original SNM in terms of efficiency but the disadvantage is that the window size is fixed and is expensive for selection and operation. Some duplicates might be missed when large window are used. The tuples in the relational structure of the database give an overview of the similar real world entities such tuples are described as duplicates. Deleting these duplicates and in turn facilitating their replacement with several other tuples represents the joint informational structure of the duplicate tuples up to a maximum level. The incorporated delete and then replacement mode of operation is termed as fusion. The removal of the original duplicate tuples can deviate from the referential integrity. Bronselaer et al., [121] describes a technique to maintain the referential integrity. The fusion Propogation algorithm is based on first and second order fusion derivatives to resolve conflicts and clashes. Traditional referential integrity strategies like DELETE cascading, are highly sophisticated. Execution time and recursively calling the propagation algorithm increases when the length of chain linked relations increases. Bleiholder et al., proposes the SQL Fuse by inducing the schema and semantics. The existential approach is towards the architecture, query languages, and query execution. The final step of actually aggregating data from multiple heterogeneous sources into a consistent and homogeneous datasetand is often inconsiderable. Naumann et al., [134] observes that amount of noisy data are in abundance from several data sources. Without any suitable techniques for integrating and fusing noisy data with deviations, the quality of data associated with an integrated system remains extremely low. It is necessary for allowing tentative and declarative integration of noisy and scattered data by incorporating schema matching, duplicate detection and fusion. Subjected to SQL-like query against a series of tables instance, oriented schema matching covers the cognitive bridge of the varied tables by alignment of various corresponding attributes. Further, a duplicate detection technique is used for multiple representations of several matching entities. Finally, the paradigm of data fusion for resolving a conflict in turn merges around Bleiholder et al., [135] explains a conceptual understanding of classification of different operators over data fusion. Numerous techniques are based on standard and advanced operators of algebraic relations and SQL. The concept of Co-clustering is explained from several techniques for tapping the rich and associated meta tag information of various multimedia web documents that includes annotations, descriptions and associations. Varied Coclustering mechanisms are proposed for linked data that are obtained from multiple sources which do not matter the representational problem of precise texts but rather increase their performance up to the most minimally empirical measurement of the multi-modal features. The two channel Heterogeneous Fusion ART (HF-ART) yields several multiple channels divergently. The GHF-ART [136] is designed to effectively represent multimedia content that incorporates Meta data to handle precise and noisy texts. It is not trained directly using the text features but can be identified as a key tag by training it with the probabilistic distribution of the tag based occurrences. The approach also incorporates a highly and the most adaptive methodology for active and efficient fusion of multimodal. # VIII. Conclusions The paper presents different techniques and framework to extract relevant features from huge amount of unstructured text documents. The paper also reviews a survey on various text classification, clustering, summerization methods. To guarantee the quality of extracted relevant features in a collection of text documents is a great challenge. Many text mining techniques have been proposed till date. However how effectively the discovered features is interesting and useful to the user is an open issue. Our future work is to efficiently utilize relevant documents from non relevant documents. Effective filtering model is required to automatically generate facets. The security and time to extract the useful features that is duplicate free and fine grained knowledge helps the user to reduce time in searching various web pages needs to be addressed. Clustering: The process of grouping similar kind ofinformation is called clustering that results in findinginteresting knowledge. The new discovered knowledgecan be used by an industry for further development andhelps in competing with their competitors.Question Answering: For seperating and combiningterms we use standard text searching techniques thatuse boolean operators. Sophisticated search in textmining executes the searching process in sentence orphrase level and verbal connection identificationbetween various search terms, which is not possible intraditional search. The result obtained by sophisticatedsearch can be used for providing specific informationthat can be influenced by an organization.Concept linkage: The results obtained from sophisticated search are linked together to produce a new hypothesis. The linking of concepts is called concept linkage. Hence, new domain of knowledge can be generated by making use of concept linkage application. 1selection algorithms.Sl.no.AuthorsFeature Selec-tion (FS)AlgorithmAdvantagesDisadvantagesZhaoetUnsupervisedGradientPreserve similarity and dis-Supervised FS is not1.[25] al.,(2016)clustering performance is criminant information, highconsideredachievedXuet-Deep LearningPerforms better than tra-Meta data information of2.al.,(2016)ditional dimensional reduc-tweet is not considered[26]tion methodWangetSupervised andGlobalFeatures are more compact-3.al.,(2015) [27]Unsupervisedredundancy Minimizationand discriminant,Superior performance withoutparameter 3) PCA and Random Projection RP: Principal identifiers that are useful for retrieving the important Component Analysis (PCA) is a simple techniquefine tune and also no co-efficient is required to adjust.used to explore and visualize the data easily. PCAFradkin et al., [51] [52] reported a number ofextracts useful information from complicated dataexperiments by evaluating random projecton insets using non parametric method. It determines asupervised learning. Different datasets were tested tolower dimension space by statistical method. Basedcompare random projection and PCA using severalon eigen value decomposition of the covariancemachine learning methods. The results show PCAmatrix transformation matrix of PCA is calculatedoutperforms RP in efficiency for supervised learning. Theand thereby computation cost is more and it is alsoresults also shows that RP's are well suited to use withnot suitable for very high dimensional data. Thenearest neighbour and with SVM classifier and are lessstrength of PCA is that there are no parameters tosatisfactory with decision trees.Year 20166)( C2) Feature Extraction for Classification: Khadhim et al.,[50] [21] developed two weighting methods TF -IDFmeans clustering algorithm is used for featureextraction for classification.and TF-IDF (Term Frequency/Inverse Document Frequency) global to reduce dimensionality of datasets because it is very difficult to process the original features i.e, thousands of features. Fuzzy c Global Journal of Computer Science and Technology Volume XVI Issue V Version I Feature Extraction and Duplicate Detection for Text Mining: A Survey 1) Feature Mining for Text Mining: Li et al.,[43] designed a new technique to discover patterns i.e., positive and negative in text document. Both relevant and irrelevant document contains useful features. Inorder to remove the noise, negative documents in the training set is used to improve the effectiveness of Pattern Taxonomy Model PTM. Two algorithms HLF mining and N revision was introduced. 2 3 4Sl.no.AuthorsAlgorithmWindow se-lectionAdvantagesDisadvantages1.PapenbrockPSNMAdaptiveEfficientDelivers resultsetal.,with limitedmoderately(2015)execution[119]time2.BanoetInnovativeAdaptiveUnnecessaryDo not supportal., (2015)WindowComparisononMultiple[120]is avoidedDatasets3.BronselaerFusion-Conflicts inMoreetal.,PropogationrelationshipExpensive(2015)attributes are[121]resolved4.Papadakis etAttribute Clus--Effective onlowqualityal., (2013)teringrealworldblocks,[122]datasetsParallelizing isnot adopted5.Papadakis et-AdaptiveTimeProcess is veryal., (2011)Complexity isslow[123]reduced© 2016 Global Journals Inc. (US) 1 Feature Extraction and Duplicate Detection for Text Mining: A Survey ## Global Journal of Computer Science and Technology Volume XVI Issue V Version I * A Detailed Study on Text Mining Techniques RAgrawal MBatra International Journal of Soft Computing and Engineering (IJSCE) ISSN 2 6 2013 * A Data Mining Approach for Data Generation and Analysis for Digital Forensic Application VHBhat PGRao RAbhilash PDShenoy KRVenugopal LPatnaik International Journal of Engineering and Technology 2 3 2010 * A Review on Text Mining YZhang MChen LLiu Proceedings of 6th IEEE International Conference on Software Engineering and Service Science (ICSESS) 6th IEEE International Conference on Software Engineering and Service Science (ICSESS) 2015 * An Efficient Concept-based Mining Model for Enhancing Text Clustering SShehata FKarray MSKamel IEEE Transactions on Knowledge and Data Engineering 22 10 2010 * A Novel Data Generation Approach for Digital Forensic Application in Data Mining VHBhat PGRao RAbhilash PDShenoy KRVenugopal LPatnaik Proceedings of Second International Conference on Machine Learning and Computing (ICMLC) Second International Conference on Machine Learning and Computing (ICMLC) 2010 * Text Mining the Contributors to Rail Accidents DEBrown IEEE Transactions on Intelligent Transportation Systems 27 5 2015 * Soft Computing for Data Mining Applications KRVenugopal KSrinivasa LMPatnaik 2009 Springer * Text Mining and Information Professionals: Role, Issues and Challenges VKVerma MRanjan PMishra Proceedings of 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services (ETTLIS) 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services (ETTLIS) 2015 * Text mining: Challenges and Future Directions AAkilan Proceedings of Second International Conference on Electronics and Communication Systems (ICECS) Second International Conference on Electronics and Communication Systems (ICECS) 2015 * Text Knowledge Mining:An Alternative to Text Data Mining DSanchez MJMartin-Bautista IBlanco CTorre Proceedings of IEEE International Conference on Data Mining Workshops(ICDMW) IEEE International Conference on Data Mining Workshops(ICDMW) 2008 * Minedec:A Decision-Support Model that Combines Textmining Technologies with Two Competitive Intelligence Analysis Methods YDai TKakkonen ESutinen International Journal of Computer Information Systems and Industrial Management Applications 3 10 2011 * Ppsgen: Learning-Based Presentation Slides Generation for Academic Papers YHu XWan IEEE Transactions on Knowledge and Data Engineering 27 4 2015 * Real-Time Detection of Traffic from Twitter Stream Analysis ED'andrea PDucange BLazzerini FMarcelloni IEEE Transactions on Intelligent Transportation Systems 16 4 2015 * RLi KHLei RKhadiwala K. C.-CChang Tedas: A Twitter-based Event Detection * Classification of Email using Beaks: Behavior and Keyword Stemming VHBhat VRMalkani PDShenoy KRVenugopal LPatnaik Proceedings of IEEE Region 10 Conference TENCON IEEE Region 10 Conference TENCON 2011 * I See a Car Crash: Real-Time Detection of Small Scale Incidents in Microblogs ASchulz PRistoski HPaulheim The Semantic Web: ESWC Satellite Events 2013 * Text Detection and Recognition on Traffic Panels from Street-Level Imagery using Visual Appearance AGonzalez LMBergasa JJYebes IEEE Transactions on Intelligent Transportation Systems 15 1 2014 * Genetic Programming for Simultaneous Feature Selection and Classifier Design DPMuni NRPal JDas IEEE Transactions on Systems, Man, and Cybernetics 36 1 2006 Part B (Cybernetics) * Text Feature Selection Using Ant Colony Optimization MHAghdam NGhasem-Aghaee MEBasiri Expert Systems with Applications 36 3 2009 * Generic Feature Extraction for Classification using Fuzzy C-means Clustering KSrinivasa ASingh AThomas KRVenugopal LPatnaik Proceedings of 3rd International Conference on Intelligent Sensing and Information Processing 3rd International Conference on Intelligent Sensing and Information Processing 2005 * Eliminating Redundancy and Irrelevance using a New Mlp-based Feature Selection Method EGasca JSS´anchez RAlonso Pattern Recognition 39 2 2006 * Computer Science and Technology Volume XVI Issue V Version I C Analysis System Proceedings of IEEE 28th International Conference on Data Engineering (ICDE) IEEE 28th International Conference on Data Engineering (ICDE) 2012 * Et: Events from Tweets RParikh KKarlapalem Proceedings of the 22nd International Conference on World Wide Web Companion the 22nd International Conference on World Wide Web Companion 2013 * A Hybrid Approach for Feature Subset Selection using Neural Networks and Ant Colony Optimization RKSivagaminathan SRamakrishnan Expert systems with Applications 33 1 2007 * Unsupervised Feature Selection for Multi-cluster Data DCai CZhang XHe Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2010 * Graph Regularized Feature Selection with Data Reconstruction ZZhao XHe DCai LZhang WNg YZhuang IEEE Transactions on Knowledge and Data Engineering 28 3 2016 * Microblog Dimensionality Reduction-A Deep Learning Approach LXu CJiang YRen H.-HChen IEEE Transactions on Knowledge and Data Engineering 28 7 2016 * Feature Selection via Global Redundancy Minimization DWang FNie HHuang IEEE Transactions on Knowledge and Data Engineering 27 10 2015 * Feature Selection with a Measure of Deviations from Poisson in Text Categorization HOgura HAmano MKondo Expert Systems with Applications 36 3 2009 * Comparison of Term Frequency and Document Frequency based Feature Selection Metrics in Text Categorization NAzam JYao Expert Systems with Applications 39 5 2012 * Graph Embedding and Extensions: A General Framework for Dimensionality Reduction SYan DXu BZhang H.-JZhang QYang SLin IEEE Transactions on Pattern Analysis and Machine Intelligence 29 1 2007 * On Similarity Preserving Feature Selection ZZhao LWang HLiu JYe IEEE Transactions on Knowledge and Data Engineering 25 3 2013 * Graph Regularized Feature Selection with Data Reconstruction ZZhao XHe LZhang WNg YZhuang IEEE Transactions on Knowledge and Data Engineering 28 3 2016 * Unsupervised Feature Selection for Linked Social Media Data JTang HLiu Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2012 * Harvesting Domain Specific Ontologies from Text HMousavi DKerr MIseli CZaniolo Proceedings of IEEE International Conference on Semantic Computing (ICSC) IEEE International Conference on Semantic Computing (ICSC) 2014 * 1-norm Regularized Discriminative Feature Selection for Unsupervised Learning YYang HTShen ZMa ZHuang XZhou Proceedings of the International Joint Conference on Artificial Intelligence(IJCAI) the International Joint Conference on Artificial Intelligence(IJCAI) 2011 2 * Graph Regularized Nonnegative Matrix Factorization for Data Representation DCai XHe JHan TSHuang IEEE Transactions on Pattern Analysis and Machine Intelligence 33 8 2011 * Qrgqr: Query Relevance Graph for Query Recommendation DSejal KShailesh VTejaswi DAnvekar KRVenugopal SIyengar LPatnaik Proceedings of IEEE Region 10 Symposium (TENSYMP) IEEE Region 10 Symposium (TENSYMP) 2015 * Unsupervised Hybrid Feature Extraction Selection for High-Dimensional Non-Gaussian Data Clustering with Variational Inference WFan NBouguila DZiou IEEE Transactions on Knowledge and Data Engineering 25 7 2013 * A Survey of Text Summarization Extractive Techniques VGupta GS Journal of Emerging Technologies in Web Intelligence 2 3 2010 * Text Summarization for Information Retrieval using Pattern Recognition Techniques PSNegi MRauthan HDhami International Journal of Computer Applications 21 10 2011 * An Analysis of the Relative Hardness of Reuters-21578 Subsets FDebole FSebastiani Journal of the American Society for Information Science and Technology 56 6 2005 * Keyphrase Extraction based on Semantic Relatedness FXie XWu XHu Proceedings of 9th IEEE International Conference on Cognitive Informatics (ICCI) 9th IEEE International Conference on Cognitive Informatics (ICCI) 2010 * Mining Positive andNegative Patterns for Relevance Feature Discovery YLi AAlgarni NZhong Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2010 * Effective Pattern Discovery for Text Mining NZhong YLi S.-TWu IEEE Transactions on Knowledge and Data Engineering 24 1 2012 * Mining Temporal Patterns in Time Interval-Based Data Y.-CChen W.-CPeng S.-Y.Lee IEEE Transactions on Knowledge and Data Engineering 27 12 2015 * Inference of Regular Expressions for Text Extraction from Examples ABartoli ALorenzo EMedvet FTarlao IEEE Transactions on Knowledge and Data Engineering 28 5 2016 * Relevance Feature Discovery for Text Mining YLi AAlgarni MAlbathan YShen MABijaksana IEEE Transactions on Knowledge and Data Engineering 27 6 2015 * A Fast Clusteringbased Feature Subset Selection Algorithm for High-Dimensional Data QSong JNi GWang IEEE Transactions on Knowledge and Data Engineering 25 1 2013 * Review Selection Using Micro-Reviews T.-SNguyen HWLauw PTsaparas IEEE Transactions on Knowledge and Data Engineering 27 4 2015 * Feature Extraction for Co-occurrence-based Cosine Similarity Score of Text Documents AIKadhim YCheah NHAhamed LASalman Proceedings of IEEE Student Conference on Research and Development (SCOReD) IEEE Student Conference on Research and Development (SCOReD) 2014 * Experiments with Random Projections for Machine Learning DFradkin DMadigan Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2003 * Classification of Alzheimer's Disease and Parkinson's Disease by using Machine Learning and Neural Network Methods SJoshi DShenoy PRashmi KRVenugopal LPatnaik Proceedings of Second International Conference on Machine Learning and Computing (ICMLC) Second International Conference on Machine Learning and Computing (ICMLC) 2010 * Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections HAhonen OHeinonen MKlemettinen AIVerkamo Proceedings of IEEE International Forum on Research and Technology Advances in Digital Libraries IEEE International Forum on Research and Technology Advances in Digital Libraries 1998 * Clospan: Mining Closed Sequential Patterns in Large Datasets XYan JHan RAfshar SDM 2003 * Clasp: An Efficient Algorithm for Mining Frequent Closed Sequences AGomariz MCampos RMarin BGoethals Advances in Knowledge Discovery and Data Mining 2013 * Closet: An Efficient Algorithm for Mining Frequent Closed Itemsets JPei JHan RMao ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 4 2 2000 * Freespan: Frequent Pattern-Projected Sequential Pattern Mining JHan JPei BMortazavi-Asl QChen UDayal M.-CHsu Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2000 * Prefixspan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth JPei JHan BMortazavi-Asl HPinto QChen UDayal M.-CHsu ICCN 2001 * Mastering c++ KRVenugopal RBuyya 2013 Tata McGraw-Hill Education * Spirit: Sequential Pattern Mining with Regular Expression Constraints MNGarofalakis RRastogi KShim 1999 99 VLDB * Effective Pattern Discovery for Text Mining NZhong YLi S.-TWu IEEE Transactions on Knowledge and Data Engineering 24 1 2012 * Operational Pattern Revealing Technique in Text Mining AInje UPatil Proceedings of IEEE Students' Conference on Electrical, Electronics and Computer Science (SCEECS) IEEE Students' Conference on Electrical, Electronics and Computer Science (SCEECS) 2014 * Efficiently Mining Long Patterns from Databases RJBayardoJr ACM Sigmod Record 27 2 1998 * A Comparative Study on Feature Selection in Text Categorization YYang JOPedersen ICML 97 1997 * Dynamic Association Rule Mining using Genetic Algorithms PDShenoy KSrinivasa LMVenugopal Patnaik Intelligent Data Analysis 9 5 2005 * Slpminer: An Algorithm for Finding Frequent Sequential Patterns using Length-Decreasing Support Constraint MSeno GKarypis Proceedings of IEEE International Conference on Data Mining IEEE International Conference on Data Mining 2002 * A Taxonomy of Sequential Pattern Mining Algorithms NRMabroukeh CIEzeife ACM Computing Surveys (CSUR) 43 1 2010 * Spade: An Efficient Algorithm for Mining Frequent Sequences MJZaki Machine learning 42 1-2 2001 * Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach JHan JPei YYin RMao Data Mining and Knowledge Discovery 8 1 2004 * Mining Closed Sequential Patterns in Large Sequence Databases VPRaju GSVarma International Journal of Database Management Systems 7 1 2015 * Interrelation Analysis of Celestial Spectra Data using Constrained Frequent Pattern Trees JZhang XZhao SZhang SYin XQin ISenior Member Knowledge-Based Systems 41 4 2013 * Distributed Data Management using Mapreduce FLi BCOoi MT SWu ACM Computing Surveys (CSUR) 46 3 2014 * Classification Framework of Mapreduce Scheduling Algorithms NTiwari SSarkar UBellur MIndrawan ACM Computing Surveys (CSUR) 47 3 2015 * Fidoop: Parallel Mining of Frequent Itemsets Using Mapreduce YXun JZhang XQin IEEE Transactions on Systems, Man, and Cybernetics: Systems 46 3 2016 * The Family of Mapreduce and Large-Scale Data Processing Systems SSakr ALiu AGFayoumi ACM Computing Surveys (CSUR) 46 1 2013 * An Efficient Algorithm of Frequent Itemsets Mining based on Mapreduce LWang LFeng JZhang PLiao Journal of Information and Computational Science 11 8 2014 * Mining Interesting Infrequent Itemsets from Very Large Data based on Mapreduce Framework TRamakrishnudu RSubramanyam International Journal of Intelligent Systems and Applications 7 7 2015 * Parallel Frequent Item Set Mining with Selective Item Replication EOzkural BUcar CAykanat IEEE Transactions on Parallel and Distributed Systems 22 10 2011 * Active Relevance Feedback for Difficult Queries ZXu RAkella Proceedings of the 17th ACM Conference on Information and Knowledge Management the 17th ACM Conference on Information and Knowledge Management 2008 * User Feedback Session with Clicked and Unclicked Documents for Related Search Recommendation SDesai VChandrasheker VMathapati KR VRajuk SSIyengar LMPatnaik IADIS-International Journal on Computer Science and Information Systems 11 1 2016 * Selecting Good Expansion Terms for Pseudo-Relevance Feedback GCao J.-YNie JGao SRobertson Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2008 * Incremental Blind Feedback: An Effective Approach to Automatic Query Expansion JHPaik DPal SKParui ACM Transactions on Asian Language Information Processing (TALIP) 13 3 2014 * Selected New Training Documents to Update User profile AAlgarni YLi YXu Proceedings of the 19 th ACM International Conference on Information and Knowledge Management the 19 th ACM International Conference on Information and Knowledge Management 2010 * A Survey on Text Categorization SNiharika VSLatha DLavanya International Journal of Computer Trends and Technology 3 1 2012 * Machine Learning in Automated Text Categorization FSebastiani ACM Computing Surveys (CSUR) 34 1 2002 * A Survey on Text Mining in Social Networks RIrfan CKKing DGrages SEwen SUKhan SAMadani JKolodziej LWang DChen ARayes The Knowledge Engineering Review 30 2 2015 * Automatic Extraction of Text Regions from Document Images by Multilevel Thresholding and kmeans Clustering HNVu TATran ISNa SHKim Proceedings of IEEE/ACIS 14th International Conference on Computer and Information Science (ICIS) IEEE/ACIS 14th International Conference on Computer and Information Science (ICIS) 2015 * Automatically Mining Facets for Queries from Their Search Results ZDou ZJiang SHu J.-RWen RSong IEEE Transactions on Knowledge and Data Engineering 28 2 2016 * Query Click and Text Similarity Graph for Query Suggestions ASejal KShailesh VTejaswi DAnvekar KRVenugopal SIyengar LPatnaik International Workshop on Machine Learning and Data Mining in Pattern Recognition 2015 * Mining Related Queries from Web Search Engine Query Logs using an Improved Association Rule Mining Model XShi CCYang Journal of the American Society for Information Science and Technology 58 12 2007 * Faceted Search and Browsing of Audio Content on Spoken Web MDiao SMukherjea NRajput KSrivastava Proceedings of the 19th ACM International Conference on Information and Knowledge Management the 19th ACM International Conference on Information and Knowledge Management 2010 * Efficient Processing of Relevant Nearest-Neighbor Queries CEfstathiades AEfentakis DPfoser ACM Transactions on Spatial Algorithms and Systems (TSAS) 2 3 2016 * Inverted Linear Quadtree: Efficient Top k Spatial Keyword S219219earch CZhang YZhang WZhang XLin IEEE Transactions on Knowledge and Data Engineering 28 7 2016 * Timeand Space-Efficient Sliding Window Top-k Query Processi-ng KPripu?zi´c IPZarko KAberer ACM Transactions on Database Systems (TODS) 40 1 2015 * Space-Efficient Frameworks for Top-k String Retrieval W.-KHon RShah SVThankachan JSVitter Journal of the ACM (JACM) 61 2 2014 * Extending Faceted Search to the General Web WKong JAllan Proceedings of the 23rd ACM International Conference on Information and Knowledge Management the 23rd ACM International Conference on Information and Knowledge Management 2014 * Ranking Related Entities: Components and Analyses MBron KBalog MDe Rijke Proceedings of the 19 th ACM International Conference on Information and Knowledge Management the 19 th ACM International Conference on Information and Knowledge Management 2010 * Spaces, Trees, and Colors: The Algorithmic Landscape of Document Retrieval on Sequences GNavarro ACM Computing Surveys (CSUR) 46 4 2014 * Exploring Topical Lead-Lag Across Corpora SLiu YChen HWei JYang KZhou SMDrucker IEEE Transactions on Knowledge and Data Engineering 27 1 2015 * Cross-Lingual Topic Discovery from Multilingual Search Engine Query Log DJiang YTong YSong ACM Transactions on Information Systems (TOIS) 35 2 2016 * Webtables: Exploring the Power of Tables on the Web MJCafarella AHalevy DZWang EWu YZhang Proceedings of the VLDB Data to find Endowment the VLDB Data to find Endowment 2008 1 * Facet Discovery for Structured Web Search: A Query-Log Mining Approach JPound SPaparizos PTsaparas Proceedings of the ACM SIGMOD International Conference on Management of Data the ACM SIGMOD International Conference on Management of Data 2011 * Static Index Pruning in Web Search Engines: Combining Term and Document Popularities with Query Views ISAltingovde ROzcan O¨Ulusoy ACM Transactions on Information Systems (TOIS) 30 1 2012 * Query-Based Data Pricing PKoutris PUpadhyaya MBalazinska BHowe DSuciu Journal of the ACM (JACM) 62 5 2015 * Discovering Tasks from Search Engine Query Logs CLucchese SOrlando RPerego FSilvestri GTolomei ACM Transactions on Information Systems (TOIS) 31 3 2013 * Differentiating Search Results on Structured Data ZLiu YChen ACM Transactions on Database Systems (TODS) 37 1 2012 * Visual Abstraction and Ordering in Faceted Browsing of Text Collections VThai P.-YRouille SHandschuh ACM Transactions on Intelligent Systems and Technology (TIST) 3 2 2012 * Top-k Diversity Queries Over Bounded Regions ECatallo PCiceri DFraternali MMartinenghi Tagliasacchi ACM Transactions on Database Systems (TODS) 38 2 2013 * Efficient Fuzzy Search in Large Text Collections HBast MCelikik ACM Transactions on Information Systems (TOIS) 31 2 2013 * Using Structural Information in Xml Keyword Search Effectively ATermehchy MWinslett ACM Transactions on Database Systems (TODS) 36 1 2011 * On Multiple Keyword Sponsored Search Auctions with Budgets RColini-Baldeschi SLeonardi MHenzinger MStarnberger ACM Transactions on Economics and Computation 4 1 2016 * The Effects of Aggregated Search Coherence on Search Behavior RArguello Capra ACM Transactions on Information Systems (TOIS) 35 1 2016 * Moving Spatial Keyword Queries: Formulation, Methods, and Analysis DWu MLYiu CSJensen ACM Transactions on Database Systems (TODS) 38 1 2013 * Efficient Algorithms and Cost Models for Reverse Spatial-Keyword K Nearest Neighbor Search YLu JLu GCong WWu CShahabi ACM * Efficient Processing of Spatial Group Keyword Queries XCao GCong TGuo CSJensen BCOoi ACM Transactions on Database Systems (TODS) 40 2 2015 * Fine-Grained Knowledge Sharing in Collaborative Environments ZGuan SYang HSun MSrivatsa XYan IEEE Transactions on Knowledge and Data Engineering 27 8 2015 * Learning to Extract Cross-Session Search Tasks HWang YSong M.-WChang XHe RWWhite WChu Proceedings of the 22nd International Conference on World Wide Web the 22nd International Conference on World Wide Web 2013 * Modeling and Analysis of Cross-Session Search Tasks AKotov PNBennett RWWhite STDumais JTeevan Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval 2011 * Progressive Duplicate Detection TPapenbrock AHeise FNaumann IEEE Transactions on Knowledge and Data Engineering 27 5 2015 * Innovative Windows for Duplicate Detection HBano FAzam International Journal of Software Engineering and Its Applications 9 1 2015 * Propagation of Data Fusion ABronselaer DVan Britsom GDe Tre IEEE Transactions on Knowledge and Data Engineering 27 5 2015 * A Blocking Framework for Entity Resolution in GPapadakis EIoannou TPalpanas CNieder´ee WNejdl Highly Heterogeneous C Transactions on Database Systems (TODS) 2014. 2013 39 Information Spaces * Efficient Entity Resolution Methods for Heterogeneous Information Spaces GPapadakis WNejdl Proceedings of IEEE 27th International Conference on Data Engineering Workshops (ICDEW) IEEE 27th International Conference on Data Engineering Workshops (ICDEW) 2011 * Framework for Evaluating Clustering Algorithms in Duplicate Detection OHassanzadeh FChiang HCLee RJMiller Proceedings of the VLDB Endowment the VLDB Endowment 2009 2 * Pay-asyougo Entity Resolution SEWhang DMarmaros HGarcia-Molina IEEE Transactions on Knowledge and Data Engineering 25 5 2013 * A Survey on Various Methods used for Detecting Duplicates in 127 AAAbraham SDKanmani ; JJTamilselvi CBGifta International Journal of Computer Applications 15 4 2011 Handling Duplicate Data in Data Warehouse for Data Mining * Duplicate Record Detection: A Survey AKElmagarmid PGIpeirotis VSVerykios IEEE Transactions on Knowledge and Data Engineering 19 1 2007 * A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication PChristen IEEE Transactions on Knowledge and Data Engineering 24 9 2012 * Robust Record Linkage Blocking using Suffix Arrays and Bloom Filters TDeVries HKe SChawla PChristen ACM Transactions on Knowledge Discovery from Data (TKDD) 5 2 2011 * Creating Probabilistic Databases from Duplicated Data OHassanzadeh RJMiller The VLDB Journal-The International Journal on Very Large Data Bases 2009 18 * Aspects of object Merging ABronselaer GDe Tr´e Annual Meeting of the North American Fuzzy Information Processing Society (NAFIPS) 2010 * Adaptive Windows for Duplicate Detection UDraisbach FNaumann SSzott OWonneberg Proceedings of IEEE 28th International Conference on Data Engineering (ICDE) IEEE 28th International Conference on Data Engineering (ICDE) 2012 * Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies FNaumann ABilke JBleiholder MWeis International Journal of Engineering Research and Technology 3 1 2014. 2006 Engineering and Management * Data Fusion JBleiholder FNaumann ACM Computing Surveys (CSUR) 41 1 2009 * Semi-Supervised Heterogeneous Fusion for Multimedia Data Co-Clustering LMeng A.-HTan DXu IEEE Transactions on Knowledge and Data Engineering 26 9 2014