# I. Introduction

ociety is increasingly becoming more digitized and as a result organisations are producing and storing vast amount of data. Managing and gaining insights from the produced data is a challenge and key to competitive advantage. Web based social applications like people connecting websites results in huge amount of unstructured text data. These huge data contains a lot of useful information. People hardly bother about the correctness of grammar while forming a sentence that may lead to lexical syntactical and semantic ambiguities. The ability of finding patterns from unstructured form of text data is a difficult task.

Data mining aims to discover previously unknown interrelations among apparently unrelated attributes of data sets by applying methods from several areas including machine learning, database systems, and statistics. Many researches have emphasized on different branches of data mining such as opinion mining, web mining, text mining. Text mining is one of the most important strategy involved in the phenomenon of knowledge discovery. It is a technique of selecting previously unknown, hidden, understandable, interesting knowledge or patterns which are not structured. The prime objective of text mining is to diminish the effort made by the users to obtain appropriate information from the collection of text sources [1].

Thus, our focus is on methods that extract useful patterns from texts in order to categorize or structure text collections. Generally, around 80 percent of company's information is saved in text documents. Hence text mining has a higher economic value than data mining. Current research in the area of text mining tackles problems of text representation, classification, clustering, information extraction or the search for and modelling of hidden patterns. Selection of characteristics, influence of domain knowledge and domain-specific procedures play an important role.

The text documents contain large scale terms, patterns and duplicate lists. Queries submitted by the user on web search are usually listed on top retrieved documents. Finding the best query facet and how to effectively use large scale patterns remains a hard problem in text mining. However, the traditional feature selection methods are not effective for selecting text features for solving relevance issue. These issues suggests that we need an efficient and effective methods to mine fine grained knowledge from the huge amount of text documents and helps the user to get information quickly about a user query without browsing tens of pages.

The paper provides a review of an innovative techniques for extracting and classifying terms and patterns. A user query is usually presented in list styles and repeated many times among top retrieved documents. To aggregate frequent lists within the top search results, various navigational techniques have been presented to mine query facets.

The Organisation of the paper is as follows: Section 1 introduces a detailed overview of text mining frameworks, application and benefits of text mining. Sections 2 and 3 reviews feature selection, feature extraction and techniques of pattern extraction. Section 4 discusses various text classification and clustering algorithms in text mining. Sections 5 and 6 introduce a detailed overview of discovering facets and fine grained knowledge. Section 7 reviews the duplicate detection in text documents. Section 8 contains the conclusions.

1 Year 2016


# ( ) C a) Text Mining Models

Text mining tasks consists of three steps: text preprocessing, text mining operations, text post processing. Text preprocessing includes data selection, Many approaches [2] have been concerned of obtaining structured datasets called intermediate forms, on which techniques of data mining [3]   When documents contains terms with same frequency. Two terms can be meaningful while the other term may be irrelevant. Inorder to discover the semantic of text, the mining model is introduced. Figure 2 represents a new mining model based on concepts. The model is proposed to analyse terms in a sentence from documents. The model contains group of concept analysis, they are sentence based concept analysis, document based concept analysis and corpus based similarity measure [4]. Similarity measure concept based analysis calculates the similarity between documents. The model effectively and efficiently finds  Benefits of text mining are better collection development to resolve user needs, information retrieval, to resolve usability and system performance, data base evaluation, hypothesis development. Information professionals(IP) [8] are always in forefront for emerging technologies. Inorder to make their product and service better and more efficient, usually libraries and information use these IP. The trained information professionals manage both technical and semantic infrastructures which is very important in text mining. IP also manages content selection and formulation of search techniques and algorithms.

Akilan et al., [9] pesented the challenges and future directions in text mining. It is mandatory to function semantic analysis to capture objects relationship in the documents. Semantic analysis is computationally expensive and operates on few words per second as text mining consists of significant language component. An effective text refining method has to be developed to process multilingual text document. Trained knowledge specialists are neceessary to deal with products and application of current text mining tools. Automated mining operations is required which can be used by technical users. Domain Knowledge plays an important role in both at text refining stage and knowledge distillation and hence helps in improving the efficiency of text mining.

Sanchez et al., [10] presented Text knowledge mining (TKM) based deductive inference that is usually targeted on the feasible subset of texts which usually search for contradictions. The procedure obtains new knowledge making a union of intermediate forms of texts from accurate knowledge expressed in the text.

Dai et al., [11] introduced competitive intelligence analysis methods FFA (Five Faces Frame work) and SWOT with text mining technologies. The knowledge is extracted from the raw data while performing transforming process that enables the business enterprises to take decisions more reliably and easily. Mining Environment for Decisions (MinEDec) system is not evaluated in real business environments.

Hu et al., [12] presented a interesting task of automatically generating presentation slides for academic papers. Using a support vector regression method, importance scores of sentences in the academic papers is provided. Another method called Integer Linear Programming is used to generate well structured slides. The method provides the researchers to prepare draft slides which helps in final slides used for presentation. The approach does not focus on tables, graphs and figures in the academic papers.


# c) Traffic based Event in Text Mining

Andrea et al., [13] [14] have proposed a realtime monitoring system for traffic event detection that fetches tweets, classifies and then notifies the users about traffic events. Tweets are fetched using some text mining techniques. It provides the class labels to each tweet that are related to a traffic event. If the event is due to an external cause, such as football match, procession and manifestation, the system also discriminate the traffic event. Final result shows it is capable of detecting traffic event but traffic condition notifications in real-time is not captured.

An efficient and scalable system from a set of microblogs/ tweets has been proposed to detect Events from Tweets (ET) [15] by considering their textual and temporal components. The main goal of proposed ET system is the efficient use of content similarity and appearance similarity among keywords and to cluster the related keywords. Hierarchical clustering technique is used to determine the events, which is based on common co-occurring features of keywords [16]. ET is evaluated on two different datasets from two different domains. The results show that it is possible to detect events of relevance efficiently. The use of semantic knowledge base like Yago is not incorporated.

Schulz et al., [17] proposed a machine learning algorithm which includes text classification and increasing the semantics of the microblog. It identifies the small scale incidents with high accuracy. It also precisely localizes microblogs in space and time which enables it to detect incidents in real time. The algorithm will not only give us information about the incident and in addition give us valuable information on previous unknown information about the incidents. It does not considers NLP techniques and large data.

ITS (Intelligent Transportation Systems) [18] recognizes the traffic panels and dig in information contained on them. Firstly, it applies white and blue color segmentation and then at some point of interest it derives descriptors. These images that can now be considered as sack of words and classified using Na¨?ve Bayes or SVM (state vector method). The kind of categorization where the images are classified based on visual appearance is new for traffic panel detection and it does not recognize multiframe integration.

Text may be loosely organized without complete information in the documents and may also contain omitted information. The text has to be scanned attentively to determine the problems. If it is not scanned and scrutinised properly then it leads to poor accuracy on unstructured data and hence preprocessing is necessary.

Preprocessing guarantees successful implementation of text analysis, but may spend substantial processing time. Text processing can be done in two basic methods. a)Feature Selection b) Feature Extraction.

Research in numerous fields like machine learning, data mining, computer vision, statistics and linked fields has led to diversity of feature selection approaches in supervised and unsupervised surroundings.

Feature Selection (FS) has an important role in data mining in categorization of text. The centralized idea of feature selection is the reduction of the dimension of the feature set by determining the features appropriately which enhances the efficiency and the performance. FS is a search process and categorized into forward search and backward search.

Mehdi et al., [19] [20] executed a innovative feature selection algorithm based on Ant Colony Optimization (ACO).

Without any prior knowledge of features, a minimal feature subset is determined by applying ACO [21]. The approach uses simple nearest neighbor classifier to show the effectiveness of ACO algorithm by reducing the computational cost and it outperforms information gain and chi methods. Complex classifiers and different kinds of datasets are not incorporated. Combining feature selection algorithm with other population-based feature selection algorithms are not considered.

Gasca et al., [22] proposed feature selection method based on Multilayer Perceptron (MLP). Under certain objective functions the approach determines and also corrects proper set of irrelevant set of attributes. It further computes the relative contributions for individual attribute in reference to the units that are to be output. For each output unit, contribution are sorted in the descending order. An objective function called prominance is computed for each attribute. Selecting the features from large document faces problem in unsupervised learning because of unnamed class labels.

Sivagaminathan et al., [23] [24] proposed a fixed size subset, an hybrid approach to solve feature subset selection problem in neural network pattern classifier. It considers both the individual performance and subset performance. Features are selected using the pheromone trail and value of heuristic by state transition rules. After selecting the feature, the global updating rule takes place to increment the features, which ultimately gives better classification performance without increase in the overall computational cost. Ogura et al., [28] proposed an approach to reduce a feature dimension space which calculates the probability distribution for each term that deviates from poissons. These deviations from poissons are non significant for the documents that does not belong to category. Three measures are employed as a benchmark and by using two classifiers SVM and K-NN gives better performance than other conventional classifiers. Gini index proved to be better than chisquare, IG in terms of macro, micro average values of F1. These measures do not utilize the number of times the term occurs in a document. The computational complexity could not be to suppressed for other typical measures such as information gain and CHI.


# Global Journal of Computer Science and Technology

Volume XVI Issue V Version I
4 Year 2016 ( ) C
Feature selection is measured based on words term and document frequency. Azam et al., [29] observes these frequencies for measuring FS. The metrics of Discriminative Power Measure (DPM) and GINI index (GINI) are incorporated and the term frequency based metric is useful for small feature set. The most important features returned by DPM and GINI tend to discover most of the available information at a faster rate, i.e. against lower number of features. The DPM and GINI are comparatively slower in covering document frequency information.

Yan et al., [30] presented a graph embedded framework for dimensionality reduction. The framework is also used as a tool and unifies many feature extraction methods. Feature is selected based on spectral graph theory and proposed framework unifys both supervised and unsupervised feature selection.

Zhao et al., [31] developed a framework for preserving feature selection similarity to handle redundant feature. A combined optimization formulation of sparse multiple output regression formulation is used for selecting similarity preserving features. The framework do not address existing kernel, metric learning methods and semi-supervise feature selection methods.


# 1) Feature Selection based Graph Reconstruction:A

Major task in efficient data mining is Feature selection. Feature selection has a significant challenge in small labeled-sample problem. If data is unlabeled then it is large. If the label of data is extremely tiny, then supervised feature selection algorithms fail for want of sufficient information. Zhao et al., [32] introduced graph regularized data construction to overcome the problems in feature selection. The approach achieves higher clustering performance in both unsupervised and supervised feature selection.

Linked social media crops enormous amount of unlabeled data. In the prevailing system, selecting features for unlabeled data is a difficult task due to the lack of label information. Tang et al., [33] proposed an unsupervised feature selection framework, LUFS(Linked Unsupervised Feature Selection), for related social media data to surpass the problem. The design builds a pseudo-class labels through social dimension extraction and spectral analysis. LUFS efficiently exploits association information but does not exploit link information. Computer vision and pattern recognition problems are the two main problems which have inherent manifold structure. A laplacian regularizer is included to smoothen the clustering process along with the scale factor. In text mining applications, several existing systems incorporate a NLP-basedtechniques which parse the text and promote the usage patterns that is used for mining and examination of the parse trees that are trivial and complex.

Mousavi et al., [34] have formulated a weighted graph depiction of text, called Text Graphs that further captures grammar which serve as semantic dealings between words that are in textual terms. The text based graphs incorporates such a framework called SemScape that creates parse trees for each sentence and uses two step pattern based procedure for facilitation of extraction from parse trees candidate terms and their parsable grammar.

Due to the absence of label information, it is hard to select the discriminative features in unsupervised learning. In the prevailing system, unsupervised feature selection algorithms frequently select the features that preserve the best data dissemination. Yang et al., [35] proposed a new approach that is L2, 1 -norm regularized Unsupervised Discriminative Feature Selection (UDFS). The algorithm chooses the most discriminative feature subset from the entire feature set in batch mode. UDFS outclasses the existing unsupervised feature selection algorithms and selects discriminative features for data representation. The performance is sensitive to the number of selected features and is data dependent.

Cai et al., [36] presented a novel algorithm, called Graph regularized Nonnegative Matrix Factorization (GNMF) [37], which explicitly considers the local invariance. In GNMF, the geometrical information of the data space is pre-arranged by building a nearest neighbor graph and gathering parts-based representation space in which two data points are adequately close to each other, if they are connected in the graph. GNMF models the data space as a sub manifold rooted in the ambient space and achieves more discriminating power than the ordinary NMF approach.

Fan et al., [38] suggested a principled vibrational framework for unsupervised feature selection using the non Gaussian data which is subjective to several applications that range from several diversified domains to disciplines. The vibrational frameworks provides a deterministic alternative for Bayesian approximation by the maximization of a lower bound on the marginal probability which has an advantage of computational efficiency.


# 2) Text

summarization and Dataset: Several approaches have been developed till date for automatic summarization by identifying important topic from single document or clustered documents. Gupta et al., [39] describes a topic representation approach that captures the topic and frequency driven approach using word probability which gives reasonable performance and conceptual simplicity.

Negi et al., [40] developed a system that summarizes the information from a clump of documents. The proposed system constructs the information from the given text. It achieves high accuracy but cannot calculate the relevance of the document.

Debole et al., [41] initially explains the three phases in the life cycle of TC system like document indexing, classifier learning and classifier evaluation. All researches takes Reuters 21578 documents for TC experiments. Several researches have used Modapte split for testing. The three subsets used for the experiments are a set of ten categories with more number of positive training examples.

Xie et al., [42] proposed an approach to the acquisition of the semantic features within phrases from a single document that extracts document keyphrases. Keyphrase extraction method always performs better than TFIDF and KEA. Keyphrase extraction is a basic research in text mining and natural language processing. The method is developed on the concept of semantic relatedness where degrees between phrases are calculated by the cooccurrences between phrases in a given document and the same is presented as a relatedness graph. The approach is not domain specific and generalizes well on journal articles and is tested on news web pages.

To obtain any online information is an easy task. We log on to the world wide web and give simple keywords. However, it is not easy for the user to read the entire information provided. Hence text summarization is needed.


# b) Feature Extraction

Zhong et al., [44] has presented an effective pattern discovery technique which includes the process of pattern deploying and pattern evolving as shown in Table 2, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. The proposed model outperforms other pure data mining-based methods, the concept based models and term-based state-of the-art models, such as BM25 and SVM.

Li et al., [47] proposed two algorithms namely Fclustering and Wfeature to discover both positive and negative patterns in the text documents. The algorithm Fclustering classifies the terms into three categories general, positive, negative automatically without using parameters manually. After classifying the terms using Fclustering, Wfeature is executed to calculate the weights of the term. Wfeature is effective because the selected terms size is less than the average size of the documents. The proposed model is evaluated on RCV, Trec topics and Reuters 21578 dataset as shown in Table 2, the model performs much better than the term based method and pattern based method. The use of


# Collection of Text Documents


# E Extract Useful Features


# Feature Weight Specificity Data Fusion


# Relevant features with Duplicate Free

Duplicate Detection irrelevance feedback strategy is highly efficient for improving the overall performance of relevance feature discovery model.

Xu et al., [26] experimented on microblog dimensionality reduction-A deep learning approach. The approach aims at extracting useful information from large amount of textual data produced by microblogging services. The approach involves mapping of natural language texts into proper numerical representations which is a challenging issue. Two types of approaches namely modifying training data and modifying training objective of deep networks are presented to use microblog specific information. Meta-information contained in tweets like embedded hyperlinks is not explored.

Nguyen et al., [49] worked on review selection using Micro-reviews. The approach consists of two steps namely matching review sentences with micro reviews and selecting a few reviews which cover many reviews. A heuristic algorithm performs computionally fast and provides informative reviews. 


# III. Pattern Extraction

Patterns which are close to their super patterns that appears in the same paragragh are termed closed relation and needs to be eliminated. The shorter pattern is not considered since it is meaningless while the longer pattern is more meaningful and hence these are significant patterns in the pattern taxonomy. Abonem et al., [53] presented text mining framework that discovers knowledge by preprocessing the data. Usually text in the documents contains words, special characters and structural information and hence special characters is replaced by symbols. It mainly focuses on refining the uninterested patterns and thus fitering decreases the time and size of search space needed for the discovery phase. It is more efficient when large collection of documents are considered. Postprocessing involves pruning, organizing and ordering of the results. The rule of each document is to find a set of characteristics phrases and keywords i.e., length, tightness and mutual confidence. The ranking of the rules within a document is measured by calculating a weight for each rule.

Mining entire set of frequent subsequence for every long pattern generates uncontrollable number of frequent subsequence which are expensive in space and time. Yan et al., [54] proposed a solution for mining only frequent closed subsequence through an algorithm Clospan-Closed Sequential Pattern Mining. Clospan efficiently mines frequent closed sequences in large data sets with low minimum support but does not take advantage of search space pruning property.

Gomariz et al., [55] presented CSpan algorithm for mining closed sequential patterns which mines closed sequential patterns early by using pruning method called occurence checking. CSpan outperforms clospan and claspalgorithm.


# Global Journal of Computer Science and Technology

Volume XVI Issue V Version I  


# b) Mining Sequential Pattern

To delimit the search and to increase the subsequence fragments Han et al., [57] proposed Freespan Frequent Pattern Projected sequential pattern Mining. Freespan fuses the mining of frequent sequence with that of frequent patterns and adopts projected sequence databases. Freespan runs quicker than the Apriori based GSP algorithm. Freespan is highly scalable and processing efficient in mining complete set of patterns. Freespan causes page thrashing as it requires extra memory. With extensive applications in data mining, mining sequential pattern encounters problems with a usage of very large database.

Pei et al., [58] proposed a sequential pattern mining method called Prefix Span(Prefix Projected sequential pattern mining). The complete set of patterns is extracted by reducing the generation of candidate subsequence. Further prefix projection largely reduces projected database size and greatly improves efficiency as shown in Table 3. Making use of RE(Regular Expression) [59] as a flexile constraint SPIRIT algorithm was proposed by Garofalakis et al., [60] for mining patterns that are sequential. A family of four algorithms is executed for forwarding a stronger relaxation of RE. Candidate sequence containing elements are pruned that do not appear in RE than its predecessor in the pattern mining loop.

The degree to which RE constraints are enforced to prune the search space of patterns are the main distinctive factor. The results on the real life data shows RE's adaptability as a user level tool for focussing on interesting patterns.

Jian et al., developed a new framework called Pattern Growth [PG]. PG is based on prefix monotonic property. Every monotonic and anti monotonic regular expression constraints are preprocessed and are pushed into a PG-based mining algorithm. PG adopts and also handles regular expression constraints which is diffi cult to explore using Apriori based method like SPIRIT. The candidate generation and test framework adopted by PG is less expensive and efficient in pushing many constraints than SPIRIT method. During Prefix growth various irrelevant sequence can be excluded in the huge dataset. Accordingly, projected database quickly shrinks. While PG outperforms SPIRIT, interesting constraints specific to complex structure mining is not be explored.

To filter the discovered patterns, Li et al., [43] [61] proposed an effective pattern discovery technique that deploys and evolves patterns to refine the discovered patterns. Using these discovered patterns, the relevant information can be determined inorder to improve the effectiveness. All frequent short patterns and long patterns are not useful and also long patterns with high specificity suffers from the low problem frequency. The problem of low frequency and misinterpretation for text mining can be solved by employing pattern deploying strategies.

Rather than using individual words, some researches used phrases to discover relevant patterns from documents collection. Hence there is a small improvement in the effectiveness of text mining becauses phrases based methods have consistency of assignment and document frequency for terms to be low. Inje et al., [62] used a pattern based taxonomy(is-a) relation to represent document rather than using single word. The computation cost is reduced by pruning unwanted patterns and hence improves the effectiveness of system.

Bayardo et al., [63] evaluated Max miner algorithm inorder to mine maximal frequent itemsets from large databases. Max-Miner reduces the space of itemsets considered through superset-frequency based pruning. There is a performance improvements over Apriori-like algorithms when frequent itemsets are long and more modest though still substantial improvements when frequent itemsets are short. Completeness at low supports on complex datasets is not achieved.

Jan et al., [64] [65] proposed propositionalization and classification that employs long first order frequent patterns for text mining. The Framework solves three text mining tasks such as information extraction, morphological disambiguation and context sensitive text correction. Propositionalization approach outperforms CBA by using frequent patterns as features. The performance of CBA classifiers greatly depends on number of class association rules and threshold values given by the user. The proposed framework shows that the distributed computation can improve performance of both method since large sample of data and a larger number offeatures are extracted.

Seno et al., [66] proposed an algorithm SLP miner that finds all sequential patterns. It performs effectively satisfying length decreasing support constraint and increases in average length of the sequence. It is expensive as pruning is not considered in this work.

Nizar et al., [67] demonstrates a taxonomy of sequential pattern mining techniques. Reducing the search space can be done by strongly minimizing the support count. Domain knowledge, distributed sequence are not considered in the mining process.


# c) Mining Frequent Sequences

To extract sequential patterns, various algorithms have been executed by making continuously repeated scans of database and making use of hash structure.


# Global Journal of Computer Science and Technology

Volume XVI Issue V Version I 8 Year 2016 ( ) Zaki et al., [68] presented a new novel algorithm SPADE for discovering sequential patterns at a high speed. SPADE decomposes the parent class into small subclasses. These sub problems are executed without depending on other subproblems in main memory by lattice approach. The lattice approach needs only one scan when having some pre-processed data. They also process depth first search and breadth first search for frequent sequence enumeration within each sublattice. By using these search strategies SPADE minimizes the computational costs and I/O costs by reducing number of database scans. It provides pruning strategies to identify the interesting patterns and prune out irrelevant patterns.

BFS outperforms DFS by having more information available for pruning while constructing a set of three sequence, two sequence. BFS require more main memory than DFS. BFS checks the track of idlists for all the classes, while DFS needs to preserve intermediate id lists for two consecutive classes along a specific path.

Han et al., [69] proposed a FP(frequent pattern tree) structure where the complete set of frequent patterns can be extracted by pattern fragment growth. Three techniques are used to achieve mining efficiency compression of the database, (i) FP tree avoids expensive repeatedly scanning database (ii) FP tree prevents generation of large number of candidates sets and uses divide and conquer method which breaks the mining task into a set of tasks that lowers search space. FP growth method [70] is efficient and also scalable for extracting both long and short frequent patterns and it is faster than Apriori algorithm.

Zhang et al., [71] executed CFP Constrained Frequent Pattern algorithm to improve the efficiency of association rule mining. The algorithm is incorporated in an interrelation analysis model for celestial spectra data. The module extracts correlation among the celestial spectra data characteristics. The model do not support for different application domain.


# d) Mining Frequent itemsets using Map Reduce

Database Management System have evolved over the last four decades and now functionally rich. Operating and managing very large amount of business data is a challenging task. MapReduce [72] [73] is a framework that process and manages a very large datasets in a distributed clusters efficiently and achieves parallelism.

Xun et al., [74] [75] executed Fidoop algorithm using mapreduce model. Fidoop algorithm uses frequent itemset with different lengths to improve workload balance metric across clusters. Fidoop handles very high dimensional data efficiently but do not work on heterogeneous clusters for mining frequent itemsets.

Wang et al., [76] proposed (FIMMR) Frequent Itemset Mining Mapreduce Framework algorithm. The algorithm initially extracts lower frequent itemset, applies pruning technique and later mines global frequrnt itemset. The speedup of algorithm is satisfactory under low minimum support threshold.

Ramakrishnudu et al., [77] finds infrequent itemset from huge data using mapreduce framework. The efficiency of framework increases as the size of the data is increased. The framework produces few intermediate items during the process.

Ozkural et al., [78] extracts frequent item set by partitioning the graph by a vertex separator. The separator mines the item distribution independently. Parallel frequent itemset algorithm replicates the items that co-relate with the separator. The algorithm minimizes redundancy and load balancing is achieved. Relationship among a very large number of items for real world database is not incorporated.


# e) Relevance Feedback Documents

Xu et al., [79] presented a Expectation Maximization(EM) algorithm for relevance feedback inoverlaps in feedback documents. Based on dirichlet compound multinominal(DCM) distribution, EM includes a background collection model reduction, by the methodology of deterministic annealing and query based regularization.

Several Queries which do not contain any relevance feedback needs improvisations by combining pseudo relevance feedback and relevance feedback using a hybrid feed-back paradigm. Instead of using static regularization, the authors adjust the regularization parameter based on the percentage of relevant feedback documents [80]. Further, the design formulates the space for a much newer document progressively. The weighted relevance is computed for an experimental design which further exploits the top retrieved documents by adjusting the selection scheme. The relevance score algorithms need to be validated on several TREC datasets.

Cao et al., [81] re-examined the assumption of most frequent terms in the false feedback documents that are useful and prove that it does not hold in reality. Distinguishing good and bad expansion terms cannot be done in the feedback documents. The difference of term distribution between feedback documents and whole document collection is exploited through the mixture model indicates that good and bad expansion terms may have similar distributions that fails to distinguish. Experiments are conducted to see that each query can keep only the good expansion terms. The new query model integrates the good terms, while classification of term is done to improve the effectiveness of retrieval. In a final query model, the classification score is used to enhance the weight of good terms. Selecting expansion terms are significantly better than traditional expansion terms by evaluating on three TREC datasets. Selection of terms has to be done carefully.

Pak et al., [82] proposed a automatic query expansion algorithm which incorporates a incremental blind approach to choose feedback documents from the top retrieved lists and further finds the terms by aggregating the scores from each feedback document. The algorithm performs significantly better on large documents.

Algarni et al., [83] proposed the adaptive relevance feature discovery(ARFD). Using a sliding window over positive and negative feedback, that ARFD updates the systems knowledge. The system provides a training documents where specific features are discovered. Various methods have been used to merge and revise the weight of the feature in a vector space. Documents are selected based on two categories. The first category is that user provide the interested topic information and the second category is that the user changed the interest topic.


# IV. Text Classification and Clustering

Text categorization [84] is a significant issue in text mining. In general, the documents contains large texts and hence it is necessary to classify them into specific classes. Text categorization can be broadly classified into supervised and unsupervised classification. Classifying documents manually is very costly and time consuming task. Hence it is necessary to construct automaic text classifiers using pre-classified sample documents whose time efficiency and accuracy is much better than manual text classification.

Computer programs often treat the document as a sack of words. The main characteristics of text categorization is feature space having high dimensionality. Even for moderate sized text documents, the feature space consists of hundreds and thousands of terms.

Sebastiani et al., [85] reviews the standard approaches that comes under machine learning paradigm for text categorization. The approach also describes the prob-lem faced while document representation constructing classifiers and evaluation of constructed classifier. The experimental study shows comparisons among different classifiers on different versions of reutor dataset. Text categorization is a good benchmark for clarrifying whether a given learning technique can scale up to substantial sizes.

Irfan et al., [86] reviews different pre-processing techniques in text mining to extract various textual patterns from the social networking sites. To explore the unstructured text available from social web, the basic approaches of text mining like classification and clustering are provided.

Wu et al., [87] presents a technique consisting of three preprocessing stages to recognize the text region of huge size and contrast data. A Segmentation algorithm cannot identify the changes that happen both in color and illumination of character in a document image. The technique followsextracting the grayscale image such as from the book cover, magazine RGB plane associated with weighted valve. A multilevel thresholding process is done on each grayscale image independently to identify text region. A recursive filter is executed to interpret which connects components is textual components. An approach to determine score is considered to findout the probabilistic text region of resultant images. If the text region has maximum score, then it is classified as textual component.


# V. Discovering Facets for Queries from Search Result

Facets means a phrase or a word. A query facet is a set of items which summarize an important aspect of a query. Dou et al., [88] [89] [90] explores solution of searching the set of facets for a user query. A system called Query Discovery (QD) miner is proposed to mine facets automatically. Expermiments are conducted for 100's of queries and results shows the effectiveness of the system as shown in the table 5. It provides interesting knowledge about a query and however improves searching for the users in different ways. The problem of generating query suggestions based on query facets is not considered that might help users find a better query more easily.

Multifacted search is an important paradigm for extracting and mining applications that provides users to analyze, navigate through multidimensional data.

Facetted search [91] can also be applied on spoken web search problem to index the metadata associated with audio content that provides audio search solution for rural people. The query interface ensures that a user is able to narrow the search results fastly. The approach focuses on indexing system and not generating precision -recall results on a labeled set of data.

Kong et al., [96] incorporated the feedback of users on the query facets into document ranking for evaluating boolean filtering feedback models that are widely used in conventional faceted search which automatically generates the facets for a user given query instead of generating for a complete corpus. The boolean filtering model is less effective than soft ranking models.

Bron et al., [97] proposed a novel framework by adding type filtering based on category information available in wikipedia. Combining a language modelling approach with heuristic based on wikipedia's external links, framework achieves high recall scores by finding homepages of top ranked entities. The model returns entities that have not been judged.

Navarro et al., [98] develops an automatic facet generation framework for an efficient document retrieval. To extract the facets a new approach is developed


# Global Journal of Computer Science and Technology

Volume XVI Issue V Version I 10 Year 2016 ( ) which is both domain independent and unsupervised. The approach generates multifaceted topic effectively. The subtopics in the text collection is not investigated.

Liu et al., [99] presented the study of exploring topical lead lag across corpora. Selecting which text corpus leads and which lags in a topic is a big challenge. Text pioneer, a visual analytic tool is introduced. The tool investigates lead lag across corpora from global to local level. Multiple perspectives of results are conveyed by two visualizations like global lead lag as hybrid tree, local lead lag as twisted ladder. Text pioneer donot analyze topics within each corpus and across corpora.

Jiang et al., [100] presented Cross Lingual Query Log Topic Model (CL-QLTM) to investigate query logs to derive the latent topics of web search data. The model incorporates different languages by collecting cooccurence relations and cross lingual dictionaries from query log. CL-QLTM is effective and superior in discovering latent topics. The model is not applied on statistical machine translation.

Cafarella et al., [101] exploited the interesting knowledge from webpages which consists of higher relevance to user when compared to traditional approach. The system records co-occurences of schema elements and helps user in navigating, creating synonyms for schema matching use.

Wordnet Domains text document. The queries given by the user is free text queries. Mapping keywords to different attributes and their values of a given entity is a challenging task. Castanet is simple and effective that achieves higher quality results than other automated category creation algorithms. WordNet is not exhaustive and few other mechanism is needed to improve coverage for unknown terms.

Pound et al., [102] proposed a solution that exploits user facetted search behaviour and structured data to find facet utility. The approach captures values and conditional values that provides attributes and values according to user preferences. Experi 

Space Efficient Framework Robust Multiple patterns are not handled ment results show that the approach is scalable and also outperforms popular commercial systems. Altingovde et al., [103] demonstrate static index pruning technique by incorporating query views like document and term centric. The technique improves the quality of top ranked result. When the web pages changes frequently the original index is not updated.

Koutris et al., [104] proposed a framework for pricing the data based on queries. The polynomial time algorithm is executed for a conjunctive queries of large class and the result shows that the data complexity instance based determincy is CO NP complete. The framework do not explore interaction between pricing and privacy.

Liu et al., [106] developed a tool that automatically differentiate structured data from search results. A feature type based approach is introduced which identifies a valid features and evaluates the quality of features using exact and heuristics computation methods. The method achieves local optimality avoids dependency on random initialization. Result differentiation (whether the selected features is interest to users are not) is not incorporated.

Liu et al., [107] proposed matrix representation to discover collection of documents based on user interest. The multidimensional visualization is presented to overcome the difficulty for users to compare across different facet values. The approach further enables visual ordering based on facet values to support cross facet compa risons of items and also support users in exploring tasks. The intradocument details are unavailable and visual scalability is not incorporated.   [105] proposed two methodlogy for extracting user tasks when they search for relevant data from search engine. The method identifies user query logs and further aggregate same kind of users tasks based on supervised and unsupervised approaches. The method is effective in detecting similar latent needs from a query log. Users task by task search behaviour is not represented in the model. 


# C

Colini et al., [111] [112] proposes multiple keyword method that provides search auctions with budgets and bidders. Bidders is bounded by multiple slots per keyword. Bidders which have cumulative valuations are click through rates and budgets that confine the overall study of multiple keyword method. Multiple keywords mechanism is compatible, optimal and rational with expectation. In combinatorial setting, each bidder is having a direct involvement in a subset of keywords. Deterministic mechanisms with temper marginal valuations are incompatible.

Wu et al., [113] introduced the concept of safe zones. It studies the moving of top K keyword query. The safe zones saves the time and communication cost. The approach computes safe zone in order to optimize server side computation. It is also used to establish the client server communication. Spatial keyword is not processed and also the safe zone do not compute future path of moving the query.

Lu et al., [114] proposed reverse spatial keyword K nearest neighbour to find the query of object which is similar to one of the neighbour. The query search is based on spatial location and also text associated with it. The algorithm is used to prune unnecessary objects and also computes the lists. The method do not considers textual description of two different objects.

Cao et al., [115] demonstrates the concept of weighing a query. The spatial keywords match considers both the location and the text. The method focuses more on finding queries to group of objects by grouping spatial objects. Top K spatial keyword and weighing of query improves the performance and efficiency. The computational time is reduced but partial coverage of queries is not considered.


# VI. Fine Grained Knowledge

Guan et al., [116] suggested "tcpdump" method to capture the web surfing activities from users. Web surfing activities reflects persons fine grained knowledge by recognizing the semantic structures. Further by using Dirichlet process infinite Guassian mixture model is adopted. D-iHMM process is employed for mining the fine grained aspect in each part by session clustering. Discovering fine grained knowledge Feature Extraction and Duplicate Detection for Text Mining: A Survey Hon et al., [95] developed space efficient frame works for top k string retrieval problem that considers two metrics for relevance features which includes frequency and proximity. The threshold based approach on these metrics are also been used. Compact space and sufficient space indexes are derived that results index space and query time with significant robustness. The framework is robust but do not index an the cache oblivious model and also the index takes twice the size of the text. Multiple patterns are not handled.

Zhang et al., [94] proposed (SPP) Space Partition and Probing to keep track of object position and relevance to the query and also to find the vector space. Quality is achieved by using MMR which is one of the important diversification algorithm. The method identifies the next top K object very quickly. SPP helps in reducing object axis and also increases the performance. Fixed bounded region is not considered. Zhang et al., [93] proposed inverted linear quadtree index structure to accomplish both spatial and keyword based techniques to effectively decreases the search space. Spatial keyword queries having two disputes: top k spatial keyword search(TOPK-SK) and batch top k spatial keyword search(BTOPK-SK), in which top-sk fetch the closest k objects which contains all keywords in the query. BTOPK-SK contains set of top k queries. Existing techniques in IL-quadtree presents firstly Keyword first index, which is to extract the related inverted indexes. Partition based method is proposed to further enhance the filtering capabilities of the signature of linear quadtree.

Efstathiades et al., [92] presents Link of Interest (LOI) to improve the quality of users queries. K Relevant Nearest Neighbor(K-RNN) queries is based on query processing method is proposed to analyse LOI information to retrieve relevant location based point of interest as shown in Table 3. The method captures the relevance aspect of data. Relevance score is not computed.

Catallo et al., [108] proposed probabilistic k-Sky band to process subset of sliding window objects, that are most recent data objects. The algorithm out performs for parameter of large values of K parameter both in memory consumption and time reduction. Adaptive top K processing is not incorporated in the approach.

Bast et al., [109] presented pre-processing techniques to achieve interactive query times on large text collections. Two similarities measures are considered which includes, firstly, query terms match -similar terms in collection. Secondly, Query terms match -terms with similar prefix in collection which display the results quickly and are more efficient and scalable.

Termehchy et al., [110] introduced the XML stru cture for searching the keyword effectively. Traditional keyword search techniques does not support effectively. In order to overcome these problems for data-centric, XML put forth the Coherency Ranking(CR), which is a database design self sustained ranking method for XML keyword queries that is based on prolonging concepts of data dependencies and mutual information. With the concepts of CR, that analyze the prior approaches to XML keyword search. Approximate coherency ranking and current potent algorithm process queries and rank their responds using CR. CR shows better precision and recall, provides better ranking than prior approaches.

reflected from people's interaction made knowledge sharing in collaborative environment much easier. Although privacy is major issue.

Wang et al., [117] analysed user's searching behaviors and considered inter-query dependencies. A semi-supervised clustering model is proposed based on the SVM framework. The model enables a more comprehensive understanding of user's searchbehaviors via query search logs and facilitates the development search-engine support for long-term tasks. The performance of the model is superior in identifying cross-session search. User modeling and long-term task based personalization is not considered.

Kotov et al., [118] proposed a method for creating a semi automatically labeled data set that can be used for identifying user's query searches from earlier sessions on the same task and to predict whether a user returns to the same task during his later session. Using logistic regression and MART classifiers the method can effectively model and crosssession of user's information needs. The model is not incorporated in commercial search engines.


# VII. Duplicate Detection and Data Fusion

Duplicate detection is the methodology of identification of multiple semantic representation of the existing and similar real world entities. The present day detection methods need to execute larger datasets in the least amount of time and hence to maintain the overall quality of datasets is tougher.

Papenbrock et al., [119] proposed a strategic approach namely the progressive duplicate detection methods as shown in Table 4 which finds the duplicates efficiently and reduces the overall processing time by reporting most of the results as shown in table 7 than the existing classical approaches.

Bano et al., [120] executed innovative windows algorithm that adapts window for duplicates and also which are not duplicates and unnecessary comparisons is avoided.

The duplicate records are a vital problem and a concern in knowledge management [124]. To Extract duplicate data items an entity resolution mechanism is employed for the procedure of cleanup. The overall evaluation reveals that the clustering algorithms perform extraordinarily well with accuracy and f-measure being high.

Whang et al., [125] investigates the enhancement of focusing on several matching records. Three types of hints that are compatible with different ER algorithms:(i) an ordered list of records, (ii) a sorted list of record pairs, (iii) a hierarchy of record partitions. The underlying disadvantage of the process is that it is useful only for database contents.

13 Year 2016


# ( ) C

Duplicate records do not share a strategic key but they build duplicate matching making it a tedious task. Errors are induced because the results of transcription errors, incomplete information and lack of normal formats. Abraham et al., [126] [127] provides survey on different techniques used for detecting duplicates in both XML and relational data. It uses elimination rule to detect duplicates in database.

Elmagarmid et al., [128] present intensive analysis of the literature on duplicate record for detection and covers various similarity metrics, which will detect some duplicate records in exceedingly available information. The strengths of the survey analysis in statistics and machine learning aims to develop a lot of refined matching techniques that deem probabilistic models.

Deduplication is an important issue in the era of huge database [129]. Various indexing techniques have been developed to reduce the number of record pairs to be compared in the matching process. The total candidates generated by these techni-ques have high efficiency with scalability and have been evaluated using various data sets.

The training data in the form of true matches and true non matches is often unavailable in various real-world applications. It is commonly up to domain and linkage experts for decision of the blocking keys. Papadakis et al., [122] presented a blocking methods for clean-clean ER over Highly Heterogeneous Information Spaces (HHIS) through an innovative framework which comprises of two orthogonal layers. The effective layer incorporates methods for construction of several blockings with small probability of hits; the efficiency layer comprises of a rich variety of techniques which restricts the required number of pairwise matches.

Papadakis et al., [123] focuses to boost the overallblocking efficiency of the quadratic task on Entity Resolution among large, noisy, and heterogeneous information areas.

The problem of merging many large databases is often encountered in KDD. It is usually referred to as the Merge/Purge problem and is difficult to resolve in scale and accuracy. The Record linkage [130] is a wellknown data integration strategy that uses sets for merging, matching and elimination of duplicate records in large and heterogeneous databases. The suffix grouping methodology facilitates the causal ordering used by the indexes 


# C

for merging blocks with least marginal extra cost resulting in high accuracy. An efficient grouping similar suffixes is carried out with incorporation of a sliding window technique. The method is helpful in various health records for understanding patient's details but is not very efficient as it concentrates only on blocking and not on windowing technique. Additionally the methodology with duplicates that are detected using the state of the art expandable paradigm is approximate [131]. It is quite helpful in creating clusters records. Bronselaer et al., [132] focused on Information aggregation approach which combine information and rules available from independent sources into summarization. Information aggregation is investigated in the context of inferencing objects from several entity relations. The complex objects are composed of merge functions for atomic and subatomic objects in a way that the composite function inherits the properties of the merge functions.

Sorted Neighborhood Method (SNM) proposed by Draisbach et al., [133] partitions data set and comparison are performed on the jurisdiction of each partition. Further, the advances in a window over the data is done by comparison of the records that appears within the range of same window. Duplicate Count Strategy (DCS) which is a variation of SNM is proposed by regulating the window size. DCS++ is proposed which is much better than the original SNM in terms of efficiency but the disadvantage is that the window size is fixed and is expensive for selection and operation. Some duplicates might be missed when large window are used.

The tuples in the relational structure of the database give an overview of the similar real world entities such tuples are described as duplicates. Deleting these duplicates and in turn facilitating their replacement with several other tuples represents the joint informational structure of the duplicate tuples up to a maximum level. The incorporated delete and then replacement mode of operation is termed as fusion. The removal of the original duplicate tuples can deviate from the referential integrity.

Bronselaer et al., [121] describes a technique to maintain the referential integrity. The fusion Propogation algorithm is based on first and second order fusion derivatives to resolve conflicts and clashes. Traditional referential integrity strategies like DELETE cascading, are highly sophisticated. Execution time and recursively calling the propagation algorithm increases when the length of chain linked relations increases.

Bleiholder et al., proposes the SQL Fuse by inducing the schema and semantics. The existential approach is towards the architecture, query languages, and query execution. The final step of actually aggregating data from multiple heterogeneous sources into a consistent and homogeneous datasetand is often inconsiderable.

Naumann et al., [134] observes that amount of noisy data are in abundance from several data sources. Without any suitable techniques for integrating and fusing noisy data with deviations, the quality of data associated with an integrated system remains extremely low. It is necessary for allowing tentative and declarative integration of noisy and scattered data by incorporating schema matching, duplicate detection and fusion. Subjected to SQL-like query against a series of tables instance, oriented schema matching covers the cognitive bridge of the varied tables by alignment of various corresponding attributes. Further, a duplicate detection technique is used for multiple representations of several matching entities. Finally, the paradigm of data fusion for resolving a conflict in turn merges around Bleiholder et al., [135] explains a conceptual understanding of classification of different operators over data fusion. Numerous techniques are based on standard and advanced operators of algebraic relations and SQL. The concept of Co-clustering is explained from several techniques for tapping the rich and associated meta tag information of various multimedia web documents that includes annotations, descriptions and associations. Varied Coclustering mechanisms are proposed for linked data that are obtained from multiple sources which do not matter the representational problem of precise texts but rather increase their performance up to the most minimally empirical measurement of the multi-modal features.

The two channel Heterogeneous Fusion ART (HF-ART) yields several multiple channels divergently. The GHF-ART [136] is designed to effectively represent multimedia content that incorporates Meta data to handle precise and noisy texts. It is not trained directly using the text features but can be identified as a key tag by training it with the probabilistic distribution of the tag based occurrences. The approach also incorporates a highly and the most adaptive methodology for active and efficient fusion of multimodal.


# VIII. Conclusions

The paper presents different techniques and framework to extract relevant features from huge amount of unstructured text documents. The paper also reviews a survey on various text classification, clustering, summerization methods.

To guarantee the quality of extracted relevant features in a collection of text documents is a great challenge. Many text mining techniques have been proposed till date. However how effectively the discovered features is interesting and useful to the user is an open issue.

Our future work is to efficiently utilize relevant documents from non relevant documents. Effective filtering model is required to automatically generate facets. The security and time to extract the useful features that is duplicate free and fine grained knowledge helps the user to reduce time in searching various web pages needs to be addressed. 


Clustering: The process of grouping similar kind ofinformation is called clustering that results in findinginteresting knowledge. The new discovered knowledgecan be used by an industry for further development andhelps in competing with their competitors.Question Answering: For seperating and combiningterms we use standard text searching techniques thatuse boolean operators. Sophisticated search in textmining executes the searching process in sentence orphrase level and verbal connection identificationbetween various search terms, which is not possible intraditional search. The result obtained by sophisticatedsearch can be used for providing specific informationthat can be influenced by an organization.Concept linkage: The results obtained from sophisticated search are linked together to produce a new hypothesis. The linking of concepts is called concept linkage. Hence, new domain of knowledge can be generated by making use of concept linkage application.
1selection algorithms.Sl.no.AuthorsFeature Selec-tion (FS)AlgorithmAdvantagesDisadvantagesZhaoetUnsupervisedGradientPreserve similarity and dis-Supervised FS is not1.[25] al.,(2016)clustering performance is criminant information, highconsideredachievedXuet-Deep LearningPerforms better than tra-Meta data information of2.al.,(2016)ditional dimensional reduc-tweet is not considered[26]tion methodWangetSupervised andGlobalFeatures are more compact-3.al.,(2015) [27]Unsupervisedredundancy Minimizationand discriminant,Superior performance withoutparameter
3) PCA and Random Projection RP: Principal identifiers that are useful for retrieving the important Component Analysis (PCA) is a simple techniquefine tune and also no co-efficient is required to adjust.used to explore and visualize the data easily. PCAFradkin et al., [51] [52] reported a number ofextracts useful information from complicated dataexperiments by evaluating random projecton insets using non parametric method. It determines asupervised learning. Different datasets were tested tolower dimension space by statistical method. Basedcompare random projection and PCA using severalon eigen value decomposition of the covariancemachine learning methods. The results show PCAmatrix transformation matrix of PCA is calculatedoutperforms RP in efficiency for supervised learning. Theand thereby computation cost is more and it is alsoresults also shows that RP's are well suited to use withnot suitable for very high dimensional data. Thenearest neighbour and with SVM classifier and are lessstrength of PCA is that there are no parameters tosatisfactory with decision trees.Year 20166)( C2) Feature Extraction for Classification: Khadhim et al.,[50] [21] developed two weighting methods TF -IDFmeans clustering algorithm is used for featureextraction for classification.and TF-IDF (Term Frequency/Inverse Document Frequency) global to reduce dimensionality of datasets because it is very difficult to process the original features i.e, thousands of features. Fuzzy c Global Journal of Computer Science and Technology Volume XVI Issue V Version I Feature Extraction and Duplicate Detection for Text Mining: A Survey 1) Feature Mining for Text Mining: Li et al.,[43] designed a new technique to discover patterns i.e., positive and negative in text document. Both relevant and irrelevant document contains useful features. Inorder to remove the noise, negative documents in the training set is used to improve the effectiveness of Pattern Taxonomy Model PTM. Two algorithms HLF mining and N revision was introduced.
2
3
4Sl.no.AuthorsAlgorithmWindow se-lectionAdvantagesDisadvantages1.PapenbrockPSNMAdaptiveEfficientDelivers resultsetal.,with limitedmoderately(2015)execution[119]time2.BanoetInnovativeAdaptiveUnnecessaryDo not supportal., (2015)WindowComparisononMultiple[120]is avoidedDatasets3.BronselaerFusion-Conflicts inMoreetal.,PropogationrelationshipExpensive(2015)attributes are[121]resolved4.Papadakis etAttribute Clus--Effective onlowqualityal., (2013)teringrealworldblocks,[122]datasetsParallelizing isnot adopted5.Papadakis et-AdaptiveTimeProcess is veryal., (2011)Complexity isslow[123]reduced© 2016 Global Journals Inc. (US) 1
			Feature Extraction and Duplicate Detection for Text Mining: A Survey
		
		
## Global Journal of Computer Science and Technology

Volume XVI Issue V Version I 
			
			
* 
	
		A Detailed Study on Text Mining Techniques
		
			RAgrawal
		
		
			MBatra
		
	
		International Journal of Soft Computing and Engineering (IJSCE) ISSN
		
			2
			6
			
			2013
		
	
* 
	
		A Data Mining Approach for Data Generation and Analysis for Digital Forensic Application
		
			VHBhat
		
		
			PGRao
		
		
			RAbhilash
		
		
			PDShenoy
		
		
			KRVenugopal
		
		
			LPatnaik
		
	
		International Journal of Engineering and Technology
		
			2
			3
			
			2010
		
	
* 
	
		A Review on Text Mining
		
			YZhang
		
		
			MChen
		
		
			LLiu
		
	
		Proceedings of 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)
				6th IEEE International Conference on Software Engineering and Service Science (ICSESS)
		
			2015
			
		
* 
	
		An Efficient Concept-based Mining Model for Enhancing Text Clustering
		
			SShehata
		
		
			FKarray
		
		
			MSKamel
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			22
			10
			
			2010
		
	
* 
	
		A Novel Data Generation Approach for Digital Forensic Application in Data Mining
		
			VHBhat
		
		
			PGRao
		
		
			RAbhilash
		
		
			PDShenoy
		
		
			KRVenugopal
		
		
			LPatnaik
		
	
		Proceedings of Second International Conference on Machine Learning and Computing (ICMLC)
				Second International Conference on Machine Learning and Computing (ICMLC)
		
			2010
			
		
* 
	
		Text Mining the Contributors to Rail Accidents
		
			DEBrown
		
	
		IEEE Transactions on Intelligent Transportation Systems
		
			27
			5
			
			2015
		
	
* 
	
		Soft Computing for Data Mining Applications
		
			KRVenugopal
		
		
			KSrinivasa
		
		
			LMPatnaik
		
		
			2009
			Springer
		
	
* 
	
		Text Mining and Information Professionals: Role, Issues and Challenges
		
			VKVerma
		
		
			MRanjan
		
		
			PMishra
		
	
		Proceedings of 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services (ETTLIS)
				4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services (ETTLIS)
		
			2015
			
		
* 
	
		Text mining: Challenges and Future Directions
		
			AAkilan
		
	
		Proceedings of Second International Conference on Electronics and Communication Systems (ICECS)
				Second International Conference on Electronics and Communication Systems (ICECS)
		
			2015
			
		
* 
	
		Text Knowledge Mining:An Alternative to Text Data Mining
		
			DSanchez
		
		
			MJMartin-Bautista
		
		
			IBlanco
		
		
			CTorre
		
	
		Proceedings of IEEE International Conference on Data Mining Workshops(ICDMW)
				IEEE International Conference on Data Mining Workshops(ICDMW)
		
			2008
			
		
* 
	
		Minedec:A Decision-Support Model that Combines Textmining Technologies with Two Competitive Intelligence Analysis Methods
		
			YDai
		
		
			TKakkonen
		
		
			ESutinen
		
	
		International Journal of Computer Information Systems and Industrial Management Applications
		
			3
			10
			
			2011
		
	
* 
	
		Ppsgen: Learning-Based Presentation Slides Generation for Academic Papers
		
			YHu
		
		
			XWan
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			27
			4
			
			2015
		
	
* 
	
		Real-Time Detection of Traffic from Twitter Stream Analysis
		
			ED'andrea
		
		
			PDucange
		
		
			BLazzerini
		
		
			FMarcelloni
		
	
		IEEE Transactions on Intelligent Transportation Systems
		
			16
			4
			
			2015
		
	
* 
	
		
			RLi
		
		
			KHLei
		
		
			RKhadiwala
		
		
			K. C.-CChang
		
	
		Tedas: A Twitter-based Event Detection
		
	
* 
	
		Classification of Email using Beaks: Behavior and Keyword Stemming
		
			VHBhat
		
		
			VRMalkani
		
		
			PDShenoy
		
		
			KRVenugopal
		
		
			LPatnaik
		
	
		Proceedings of IEEE Region 10 Conference TENCON
				IEEE Region 10 Conference TENCON
		
			2011
			
		
* 
	
		I See a Car Crash: Real-Time Detection of Small Scale Incidents in Microblogs
		
			ASchulz
		
		
			PRistoski
		
		
			HPaulheim
		
	
		The Semantic Web: ESWC Satellite Events
				
			2013
			
		
* 
	
		Text Detection and Recognition on Traffic Panels from Street-Level Imagery using Visual Appearance
		
			AGonzalez
		
		
			LMBergasa
		
		
			JJYebes
		
	
		IEEE Transactions on Intelligent Transportation Systems
		
			15
			1
			
			2014
		
	
* 
	
		Genetic Programming for Simultaneous Feature Selection and Classifier Design
		
			DPMuni
		
		
			NRPal
		
		
			JDas
		
	
		IEEE Transactions on Systems, Man, and Cybernetics
		
			36
			1
			
			2006
		
	
	Part B (Cybernetics)


* 
	
		Text Feature Selection Using Ant Colony Optimization
		
			MHAghdam
		
		
			NGhasem-Aghaee
		
		
			MEBasiri
		
	
		Expert Systems with Applications
		
			36
			3
			
			2009
		
	
* 
	
		Generic Feature Extraction for Classification using Fuzzy C-means Clustering
		
			KSrinivasa
		
		
			ASingh
		
		
			AThomas
		
		
			KRVenugopal
		
		
			LPatnaik
		
	
		Proceedings of 3rd International Conference on Intelligent Sensing and Information Processing
				3rd International Conference on Intelligent Sensing and Information Processing
		
			2005
			
		
* 
	
		Eliminating Redundancy and Irrelevance using a New Mlp-based Feature Selection Method
		
			EGasca
		
		
			JSS´anchez
		
		
			RAlonso
		
	
		Pattern Recognition
		
			39
			2
			
			2006
		
	
* 
	
		Computer Science and Technology Volume XVI Issue V Version I C Analysis System
	
	
		Proceedings of IEEE 28th International Conference on Data Engineering (ICDE)
				IEEE 28th International Conference on Data Engineering (ICDE)
		
			2012
			
		
* 
	
		Et: Events from Tweets
		
			RParikh
		
		
			KKarlapalem
		
	
		Proceedings of the 22nd International Conference on World Wide Web Companion
				the 22nd International Conference on World Wide Web Companion
		
			2013
			
		
* 
	
		A Hybrid Approach for Feature Subset Selection using Neural Networks and Ant Colony Optimization
		
			RKSivagaminathan
		
		
			SRamakrishnan
		
	
		Expert systems with Applications
		
			33
			1
			
			2007
		
	
* 
	
		Unsupervised Feature Selection for Multi-cluster Data
		
			DCai
		
		
			CZhang
		
		
			XHe
		
	
		Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
				the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
		
			2010
			
		
* 
	
		Graph Regularized Feature Selection with Data Reconstruction
		
			ZZhao
		
		
			XHe
		
		
			DCai
		
		
			LZhang
		
		
			WNg
		
		
			YZhuang
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			28
			3
			
			2016
		
	
* 
	
		Microblog Dimensionality Reduction-A Deep Learning Approach
		
			LXu
		
		
			CJiang
		
		
			YRen
		
		
			H.-HChen
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			28
			7
			
			2016
		
	
* 
	
		Feature Selection via Global Redundancy Minimization
		
			DWang
		
		
			FNie
		
		
			HHuang
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			27
			10
			
			2015
		
	
* 
	
		Feature Selection with a Measure of Deviations from Poisson in Text Categorization
		
			HOgura
		
		
			HAmano
		
		
			MKondo
		
	
		Expert Systems with Applications
		
			36
			3
			
			2009
		
	
* 
	
		Comparison of Term Frequency and Document Frequency based Feature Selection Metrics in Text Categorization
		
			NAzam
		
		
			JYao
		
	
		Expert Systems with Applications
		
			39
			5
			
			2012
		
	
* 
	
		Graph Embedding and Extensions: A General Framework for Dimensionality Reduction
		
			SYan
		
		
			DXu
		
		
			BZhang
		
		
			H.-JZhang
		
		
			QYang
		
		
			SLin
		
	
		IEEE Transactions on Pattern Analysis and Machine Intelligence
		
			29
			1
			
			2007
		
	
* 
	
		On Similarity Preserving Feature Selection
		
			ZZhao
		
		
			LWang
		
		
			HLiu
		
		
			JYe
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			25
			3
			
			2013
		
	
* 
	
		Graph Regularized Feature Selection with Data Reconstruction
		
			ZZhao
		
		
			XHe
		
		
			LZhang
		
		
			WNg
		
		
			YZhuang
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			28
			3
			
			2016
		
	
* 
	
		Unsupervised Feature Selection for Linked Social Media Data
		
			JTang
		
		
			HLiu
		
	
		Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
				the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
		
			2012
			
		
* 
	
		Harvesting Domain Specific Ontologies from Text
		
			HMousavi
		
		
			DKerr
		
		
			MIseli
		
		
			CZaniolo
		
	
		Proceedings of IEEE International Conference on Semantic Computing (ICSC)
				IEEE International Conference on Semantic Computing (ICSC)
		
			2014
			
		
* 
	
		1-norm Regularized Discriminative Feature Selection for Unsupervised Learning
		
			YYang
		
		
			HTShen
		
		
			ZMa
		
		
			ZHuang
		
		
			XZhou
		
	
		Proceedings of the International Joint Conference on Artificial Intelligence(IJCAI)
				the International Joint Conference on Artificial Intelligence(IJCAI)
		
			2011
			2
			
		
* 
	
		Graph Regularized Nonnegative Matrix Factorization for Data Representation
		
			DCai
		
		
			XHe
		
		
			JHan
		
		
			TSHuang
		
	
		IEEE Transactions on Pattern Analysis and Machine Intelligence
		
			33
			8
			
			2011
		
	
* 
	
		Qrgqr: Query Relevance Graph for Query Recommendation
		
			DSejal
		
		
			KShailesh
		
		
			VTejaswi
		
		
			DAnvekar
		
		
			KRVenugopal
		
		
			SIyengar
		
		
			LPatnaik
		
	
		Proceedings of IEEE Region 10 Symposium (TENSYMP)
				IEEE Region 10 Symposium (TENSYMP)
		
			2015
			
		
* 
	
		Unsupervised Hybrid Feature Extraction Selection for High-Dimensional Non-Gaussian Data Clustering with Variational Inference
		
			WFan
		
		
			NBouguila
		
		
			DZiou
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			25
			7
			
			2013
		
	
* 
	
		A Survey of Text Summarization Extractive Techniques
		
			VGupta
		
		
			GS
		
	
		Journal of Emerging Technologies in Web Intelligence
		
			2
			3
			
			2010
		
	
* 
	
		Text Summarization for Information Retrieval using Pattern Recognition Techniques
		
			PSNegi
		
		
			MRauthan
		
		
			HDhami
		
	
		International Journal of Computer Applications
		
			21
			10
			
			2011
		
	
* 
	
		An Analysis of the Relative Hardness of Reuters-21578 Subsets
		
			FDebole
		
		
			FSebastiani
		
	
		Journal of the American Society for Information Science and Technology
		
			56
			6
			
			2005
		
	
* 
	
		Keyphrase Extraction based on Semantic Relatedness
		
			FXie
		
		
			XWu
		
		
			XHu
		
	
		Proceedings of 9th IEEE International Conference on Cognitive Informatics (ICCI)
				9th IEEE International Conference on Cognitive Informatics (ICCI)
		
			2010
			
		
* 
	
		Mining Positive andNegative Patterns for Relevance Feature Discovery
		
			YLi
		
		
			AAlgarni
		
		
			NZhong
		
	
		Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
				the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
		
			2010
			
		
* 
	
		Effective Pattern Discovery for Text Mining
		
			NZhong
		
		
			YLi
		
		
			S.-TWu
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			24
			1
			
			2012
		
	
* 
	
		Mining Temporal Patterns in Time Interval-Based Data
		
			Y.-CChen
		
		
			W.-CPeng
		
		
			S.-Y.Lee
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			27
			12
			
			2015
		
	
* 
	
		Inference of Regular Expressions for Text Extraction from Examples
		
			ABartoli
		
		
			ALorenzo
		
		
			EMedvet
		
		
			FTarlao
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			28
			5
			
			2016
		
	
* 
	
		Relevance Feature Discovery for Text Mining
		
			YLi
		
		
			AAlgarni
		
		
			MAlbathan
		
		
			YShen
		
		
			MABijaksana
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			27
			6
			
			2015
		
	
* 
	
		A Fast Clusteringbased Feature Subset Selection Algorithm for High-Dimensional Data
		
			QSong
		
		
			JNi
		
		
			GWang
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			25
			1
			
			2013
		
	
* 
	
		Review Selection Using Micro-Reviews
		
			T.-SNguyen
		
		
			HWLauw
		
		
			PTsaparas
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			27
			4
			
			2015
		
	
* 
	
		Feature Extraction for Co-occurrence-based Cosine Similarity Score of Text Documents
		
			AIKadhim
		
		
			YCheah
		
		
			NHAhamed
		
		
			LASalman
		
	
		Proceedings of IEEE Student Conference on Research and Development (SCOReD)
				IEEE Student Conference on Research and Development (SCOReD)
		
			2014
			
		
* 
	
		Experiments with Random Projections for Machine Learning
		
			DFradkin
		
		
			DMadigan
		
	
		Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
				the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
		
			2003
			
		
* 
	
		Classification of Alzheimer's Disease and Parkinson's Disease by using Machine Learning and Neural Network Methods
		
			SJoshi
		
		
			DShenoy
		
		
			PRashmi
		
		
			KRVenugopal
		
		
			LPatnaik
		
	
		Proceedings of Second International Conference on Machine Learning and Computing (ICMLC)
				Second International Conference on Machine Learning and Computing (ICMLC)
		
			2010
			
		
* 
	
		Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections
		
			HAhonen
		
		
			OHeinonen
		
		
			MKlemettinen
		
		
			AIVerkamo
		
	
		Proceedings of IEEE International Forum on Research and Technology Advances in Digital Libraries
				IEEE International Forum on Research and Technology Advances in Digital Libraries
		
			1998
			
		
* 
	
		Clospan: Mining Closed Sequential Patterns in Large Datasets
		
			XYan
		
		
			JHan
		
		
			RAfshar
		
	
		SDM
				
			2003
			
		
* 
	
		Clasp: An Efficient Algorithm for Mining Frequent Closed Sequences
		
			AGomariz
		
		
			MCampos
		
		
			RMarin
		
		
			BGoethals
		
	
		Advances in Knowledge Discovery and Data Mining
				
			2013
			
		
* 
	
		Closet: An Efficient Algorithm for Mining Frequent Closed Itemsets
		
			JPei
		
		
			JHan
		
		
			RMao
		
	
		ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery
		
			4
			2
			
			2000
		
	
* 
	
		Freespan: Frequent Pattern-Projected Sequential Pattern Mining
		
			JHan
		
		
			JPei
		
		
			BMortazavi-Asl
		
		
			QChen
		
		
			UDayal
		
		
			M.-CHsu
		
	
		Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
				the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
		
			2000
			
		
* 
	
		Prefixspan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth
		
			JPei
		
		
			JHan
		
		
			BMortazavi-Asl
		
		
			HPinto
		
		
			QChen
		
		
			UDayal
		
		
			M.-CHsu
		
	
		ICCN
		
			
			2001
		
	
* 
	
		Mastering c++
		
			KRVenugopal
		
		
			RBuyya
		
		
			2013
			Tata McGraw-Hill Education
		
	
* 
	
		Spirit: Sequential Pattern Mining with Regular Expression Constraints
		
			MNGarofalakis
		
		
			RRastogi
		
		
			KShim
		
		
			1999
			99
			
		
	VLDB


* 
	
		Effective Pattern Discovery for Text Mining
		
			NZhong
		
		
			YLi
		
		
			S.-TWu
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			24
			1
			
			2012
		
	
* 
	
		Operational Pattern Revealing Technique in Text Mining
		
			AInje
		
		
			UPatil
		
	
		Proceedings of IEEE Students' Conference on Electrical, Electronics and Computer Science (SCEECS)
				IEEE Students' Conference on Electrical, Electronics and Computer Science (SCEECS)
		
			2014
			
		
* 
	
		Efficiently Mining Long Patterns from Databases
		
			RJBayardoJr
		
	
		ACM Sigmod Record
		
			27
			2
			
			1998
		
	
* 
	
		A Comparative Study on Feature Selection in Text Categorization
		
			YYang
		
		
			JOPedersen
		
	
		ICML
		
			97
			
			1997
		
	
* 
	
		Dynamic Association Rule Mining using Genetic Algorithms
		
			PDShenoy
		
		
			KSrinivasa
		
		
			LMVenugopal
		
		
			Patnaik
		
	
		Intelligent Data Analysis
		
			9
			5
			
			2005
		
	
* 
	
		Slpminer: An Algorithm for Finding Frequent Sequential Patterns using Length-Decreasing Support Constraint
		
			MSeno
		
		
			GKarypis
		
	
		Proceedings of IEEE International Conference on Data Mining
				IEEE International Conference on Data Mining
		
			2002
			
		
* 
	
		A Taxonomy of Sequential Pattern Mining Algorithms
		
			NRMabroukeh
		
		
			CIEzeife
		
	
		ACM Computing Surveys (CSUR)
		
			43
			1
			
			2010
		
	
* 
	
		Spade: An Efficient Algorithm for Mining Frequent Sequences
		
			MJZaki
		
	
		Machine learning
		
			42
			1-2
			
			2001
		
	
* 
	
		Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach
		
			JHan
		
		
			JPei
		
		
			YYin
		
		
			RMao
		
	
		Data Mining and Knowledge Discovery
		
			8
			1
			
			2004
		
	
* 
	
		Mining Closed Sequential Patterns in Large Sequence Databases
		
			VPRaju
		
		
			GSVarma
		
	
		International Journal of Database Management Systems
		
			7
			1
			
			2015
		
	
* 
	
		Interrelation Analysis of Celestial Spectra Data using Constrained Frequent Pattern Trees
		
			JZhang
		
		
			XZhao
		
		
			SZhang
		
		
			SYin
		
		
			XQin
		
		
			ISenior
		
		
			Member
		
	
		Knowledge-Based Systems
		
			41
			4
			
			2013
		
	
* 
	
		Distributed Data Management using Mapreduce
		
			FLi
		
		
			BCOoi
		
		
			MT
		
		
			SWu
		
	
		ACM Computing Surveys (CSUR)
		
			46
			3
			
			2014
		
	
* 
	
		Classification Framework of Mapreduce Scheduling Algorithms
		
			NTiwari
		
		
			SSarkar
		
		
			UBellur
		
		
			MIndrawan
		
	
		ACM Computing Surveys (CSUR)
		
			47
			3
			
			2015
		
	
* 
	
		Fidoop: Parallel Mining of Frequent Itemsets Using Mapreduce
		
			YXun
		
		
			JZhang
		
		
			XQin
		
	
		IEEE Transactions on Systems, Man, and Cybernetics: Systems
		
			46
			3
			
			2016
		
	
* 
	
		The Family of Mapreduce and Large-Scale Data Processing Systems
		
			SSakr
		
		
			ALiu
		
		
			AGFayoumi
		
	
		ACM Computing Surveys (CSUR)
		
			46
			1
			
			2013
		
	
* 
	
		An Efficient Algorithm of Frequent Itemsets Mining based on Mapreduce
		
			LWang
		
		
			LFeng
		
		
			JZhang
		
		
			PLiao
		
	
		Journal of Information and Computational Science
		
			11
			8
			
			2014
		
	
* 
	
		Mining Interesting Infrequent Itemsets from Very Large Data based on Mapreduce Framework
		
			TRamakrishnudu
		
		
			RSubramanyam
		
	
		International Journal of Intelligent Systems and Applications
		
			7
			7
			
			2015
		
	
* 
	
		Parallel Frequent Item Set Mining with Selective Item Replication
		
			EOzkural
		
		
			BUcar
		
		
			CAykanat
		
	
		IEEE Transactions on Parallel and Distributed Systems
		
			22
			10
			
			2011
		
	
* 
	
		Active Relevance Feedback for Difficult Queries
		
			ZXu
		
		
			RAkella
		
	
		Proceedings of the 17th ACM Conference on Information and Knowledge Management
				the 17th ACM Conference on Information and Knowledge Management
		
			2008
			
		
* 
	
		User Feedback Session with Clicked and Unclicked Documents for Related Search Recommendation
		
			SDesai
		
		
			VChandrasheker
		
		
			VMathapati
		
		
			KR VRajuk
		
		
			SSIyengar
		
		
			LMPatnaik
		
	
		IADIS-International Journal on Computer Science and Information Systems
		
			11
			1
			
			2016
		
	
* 
	
		Selecting Good Expansion Terms for Pseudo-Relevance Feedback
		
			GCao
		
		
			J.-YNie
		
		
			JGao
		
		
			SRobertson
		
	
		Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
				the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
		
			2008
			
		
* 
	
		Incremental Blind Feedback: An Effective Approach to Automatic Query Expansion
		
			JHPaik
		
		
			DPal
		
		
			SKParui
		
	
		ACM Transactions on Asian Language Information Processing (TALIP)
		
			13
			3
			
			2014
		
	
* 
	
		Selected New Training Documents to Update User profile
		
			AAlgarni
		
		
			YLi
		
		
			YXu
		
	
		Proceedings of the 19 th ACM International Conference on Information and Knowledge Management
				the 19 th ACM International Conference on Information and Knowledge Management
		
			2010
			
		
* 
	
		A Survey on Text Categorization
		
			SNiharika
		
		
			VSLatha
		
		
			DLavanya
		
	
		International Journal of Computer Trends and Technology
		
			3
			1
			
			2012
		
	
* 
	
		Machine Learning in Automated Text Categorization
		
			FSebastiani
		
	
		ACM Computing Surveys (CSUR)
		
			34
			1
			
			2002
		
	
* 
	
		A Survey on Text Mining in Social Networks
		
			RIrfan
		
		
			CKKing
		
		
			DGrages
		
		
			SEwen
		
		
			SUKhan
		
		
			SAMadani
		
		
			JKolodziej
		
		
			LWang
		
		
			DChen
		
		
			ARayes
		
	
		The Knowledge Engineering Review
		
			30
			2
			
			2015
		
	
* 
	
		Automatic Extraction of Text Regions from Document Images by Multilevel Thresholding and kmeans Clustering
		
			HNVu
		
		
			TATran
		
		
			ISNa
		
		
			SHKim
		
	
		Proceedings of IEEE/ACIS 14th International Conference on Computer and Information Science (ICIS)
				IEEE/ACIS 14th International Conference on Computer and Information Science (ICIS)
		
			2015
			
		
* 
	
		Automatically Mining Facets for Queries from Their Search Results
		
			ZDou
		
		
			ZJiang
		
		
			SHu
		
		
			J.-RWen
		
		
			RSong
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			28
			2
			
			2016
		
	
* 
	
		Query Click and Text Similarity Graph for Query Suggestions
		
			ASejal
		
		
			KShailesh
		
		
			VTejaswi
		
		
			DAnvekar
		
		
			KRVenugopal
		
		
			SIyengar
		
		
			LPatnaik
		
	
		International Workshop on Machine Learning and Data Mining in Pattern Recognition
				
			2015
			
		
* 
	
		Mining Related Queries from Web Search Engine Query Logs using an Improved Association Rule Mining Model
		
			XShi
		
		
			CCYang
		
	
		Journal of the American Society for Information Science and Technology
		
			58
			12
			
			2007
		
	
* 
	
		Faceted Search and Browsing of Audio Content on Spoken Web
		
			MDiao
		
		
			SMukherjea
		
		
			NRajput
		
		
			KSrivastava
		
	
		Proceedings of the 19th ACM International Conference on Information and Knowledge Management
				the 19th ACM International Conference on Information and Knowledge Management
		
			2010
			
		
* 
	
		Efficient Processing of Relevant Nearest-Neighbor Queries
		
			CEfstathiades
		
		
			AEfentakis
		
		
			DPfoser
		
	
		ACM Transactions on Spatial Algorithms and Systems (TSAS)
		
			2
			3
			
			2016
		
	
* 
	
		Inverted Linear Quadtree: Efficient Top k Spatial Keyword S219219earch
		
			CZhang
		
		
			YZhang
		
		
			WZhang
		
		
			XLin
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			28
			7
			
			2016
		
	
* 
	
		Timeand Space-Efficient Sliding Window Top-k Query Processi-ng
		
			KPripu?zi´c
		
		
			IPZarko
		
		
			KAberer
		
	
		ACM Transactions on Database Systems (TODS)
		
			40
			1
			
			2015
		
	
* 
	
		Space-Efficient Frameworks for Top-k String Retrieval
		
			W.-KHon
		
		
			RShah
		
		
			SVThankachan
		
		
			JSVitter
		
	
		Journal of the ACM (JACM)
		
			61
			2
			
			2014
		
	
* 
	
		Extending Faceted Search to the General Web
		
			WKong
		
		
			JAllan
		
	
		Proceedings of the 23rd ACM International Conference on Information and Knowledge Management
				the 23rd ACM International Conference on Information and Knowledge Management
		
			2014
			
		
* 
	
		Ranking Related Entities: Components and Analyses
		
			MBron
		
		
			KBalog
		
		
			MDe Rijke
		
	
		Proceedings of the 19 th ACM International Conference on Information and Knowledge Management
				the 19 th ACM International Conference on Information and Knowledge Management
		
			2010
			
		
* 
	
		Spaces, Trees, and Colors: The Algorithmic Landscape of Document Retrieval on Sequences
		
			GNavarro
		
	
		ACM Computing Surveys (CSUR)
		
			46
			4
			
			2014
		
	
* 
	
		Exploring Topical Lead-Lag Across Corpora
		
			SLiu
		
		
			YChen
		
		
			HWei
		
		
			JYang
		
		
			KZhou
		
		
			SMDrucker
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			27
			1
			
			2015
		
	
* 
	
		Cross-Lingual Topic Discovery from Multilingual Search Engine Query Log
		
			DJiang
		
		
			YTong
		
		
			YSong
		
	
		ACM Transactions on Information Systems (TOIS)
		
			35
			2
			
			2016
		
	
* 
	
		Webtables: Exploring the Power of Tables on the Web
		
			MJCafarella
		
		
			AHalevy
		
		
			DZWang
		
		
			EWu
		
		
			YZhang
		
	
		Proceedings of the VLDB Data to find Endowment
				the VLDB Data to find Endowment
		
			2008
			1
			
		
* 
	
		Facet Discovery for Structured Web Search: A Query-Log Mining Approach
		
			JPound
		
		
			SPaparizos
		
		
			PTsaparas
		
	
		Proceedings of the ACM SIGMOD International Conference on Management of Data
				the ACM SIGMOD International Conference on Management of Data
		
			2011
			
		
* 
	
		Static Index Pruning in Web Search Engines: Combining Term and Document Popularities with Query Views
		
			ISAltingovde
		
		
			ROzcan
		
		
			O¨Ulusoy
		
	
		ACM Transactions on Information Systems (TOIS)
		
			30
			1
			
			2012
		
	
* 
	
		Query-Based Data Pricing
		
			PKoutris
		
		
			PUpadhyaya
		
		
			MBalazinska
		
		
			BHowe
		
		
			DSuciu
		
	
		Journal of the ACM (JACM)
		
			62
			5
			
			2015
		
	
* 
	
		Discovering Tasks from Search Engine Query Logs
		
			CLucchese
		
		
			SOrlando
		
		
			RPerego
		
		
			FSilvestri
		
		
			GTolomei
		
	
		ACM Transactions on Information Systems (TOIS)
		
			31
			3
			
			2013
		
	
* 
	
		Differentiating Search Results on Structured Data
		
			ZLiu
		
		
			YChen
		
	
		ACM Transactions on Database Systems (TODS)
		
			37
			1
			
			2012
		
	
* 
	
		Visual Abstraction and Ordering in Faceted Browsing of Text Collections
		
			VThai
		
		
			P.-YRouille
		
		
			SHandschuh
		
	
		ACM Transactions on Intelligent Systems and Technology (TIST)
		
			3
			2
			
			2012
		
	
* 
	
		Top-k Diversity Queries Over Bounded Regions
		
			ECatallo
		
		
			PCiceri
		
		
			DFraternali
		
		
			MMartinenghi
		
		
			Tagliasacchi
		
	
		ACM Transactions on Database Systems (TODS)
		
			38
			2
			
			2013
		
	
* 
	
		Efficient Fuzzy Search in Large Text Collections
		
			HBast
		
		
			MCelikik
		
	
		ACM Transactions on Information Systems (TOIS)
		
			31
			2
			
			2013
		
	
* 
	
		Using Structural Information in Xml Keyword Search Effectively
		
			ATermehchy
		
		
			MWinslett
		
	
		ACM Transactions on Database Systems (TODS)
		
			36
			1
			
			2011
		
	
* 
	
		On Multiple Keyword Sponsored Search Auctions with Budgets
		
			RColini-Baldeschi
		
		
			SLeonardi
		
		
			MHenzinger
		
		
			MStarnberger
		
	
		ACM Transactions on Economics and Computation
		
			4
			1
			
			2016
		
	
* 
	
		The Effects of Aggregated Search Coherence on Search Behavior
		
			RArguello
		
		
			Capra
		
	
		ACM Transactions on Information Systems (TOIS)
		
			35
			1
			
			2016
		
	
* 
	
		Moving Spatial Keyword Queries: Formulation, Methods, and Analysis
		
			DWu
		
		
			MLYiu
		
		
			CSJensen
		
	
		ACM Transactions on Database Systems (TODS)
		
			38
			1
			
			2013
		
	
* 
	
		Efficient Algorithms and Cost Models for Reverse Spatial-Keyword K Nearest Neighbor Search
		
			YLu
		
		
			JLu
		
		
			GCong
		
		
			WWu
		
		
			CShahabi
		
		
			ACM
		
	
* 
	
		Efficient Processing of Spatial Group Keyword Queries
		
			XCao
		
		
			GCong
		
		
			TGuo
		
		
			CSJensen
		
		
			BCOoi
		
	
		ACM Transactions on Database Systems (TODS)
		
			40
			2
			
			2015
		
	
* 
	
		Fine-Grained Knowledge Sharing in Collaborative Environments
		
			ZGuan
		
		
			SYang
		
		
			HSun
		
		
			MSrivatsa
		
		
			XYan
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			27
			8
			
			2015
		
	
* 
	
		Learning to Extract Cross-Session Search Tasks
		
			HWang
		
		
			YSong
		
		
			M.-WChang
		
		
			XHe
		
		
			RWWhite
		
		
			WChu
		
	
		Proceedings of the 22nd International Conference on World Wide Web
				the 22nd International Conference on World Wide Web
		
			2013
			
		
* 
	
		Modeling and Analysis of Cross-Session Search Tasks
		
			AKotov
		
		
			PNBennett
		
		
			RWWhite
		
		
			STDumais
		
		
			JTeevan
		
	
		Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
				the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
		
			2011
			
		
* 
	
		Progressive Duplicate Detection
		
			TPapenbrock
		
		
			AHeise
		
		
			FNaumann
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			27
			5
			
			2015
		
	
* 
	
		Innovative Windows for Duplicate Detection
		
			HBano
		
		
			FAzam
		
	
		International Journal of Software Engineering and Its Applications
		
			9
			1
			
			2015
		
	
* 
	
		Propagation of Data Fusion
		
			ABronselaer
		
		
			DVan Britsom
		
		
			GDe Tre
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			27
			5
			
			2015
		
	
* 
	
		A Blocking Framework for Entity Resolution in
		
			GPapadakis
		
		
			EIoannou
		
		
			TPalpanas
		
		
			CNieder´ee
		
		
			WNejdl
		
	
		Highly Heterogeneous C Transactions on Database Systems (TODS)
				
			2014. 2013
			39
			
		
	Information Spaces


* 
	
		Efficient Entity Resolution Methods for Heterogeneous Information Spaces
		
			GPapadakis
		
		
			WNejdl
		
	
		Proceedings of IEEE 27th International Conference on Data Engineering Workshops (ICDEW)
				IEEE 27th International Conference on Data Engineering Workshops (ICDEW)
		
			2011
			
		
* 
	
		Framework for Evaluating Clustering Algorithms in Duplicate Detection
		
			OHassanzadeh
		
		
			FChiang
		
		
			HCLee
		
		
			RJMiller
		
	
		Proceedings of the VLDB Endowment
				the VLDB Endowment
		
			2009
			2
			
		
* 
	
		Pay-asyougo Entity Resolution
		
			SEWhang
		
		
			DMarmaros
		
		
			HGarcia-Molina
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			25
			5
			
			2013
		
	
* 
	
		A Survey on Various Methods used for Detecting Duplicates in 127
		
			AAAbraham
		
		
			SDKanmani
		
		
			; JJTamilselvi
		
		
			CBGifta
		
	
		International Journal of Computer Applications
		
			15
			4
			
			2011
		
	
	Handling Duplicate Data in Data Warehouse for Data Mining


* 
	
		Duplicate Record Detection: A Survey
		
			AKElmagarmid
		
		
			PGIpeirotis
		
		
			VSVerykios
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			19
			1
			
			2007
		
	
* 
	
		A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
		
			PChristen
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			24
			9
			
			2012
		
	
* 
	
		Robust Record Linkage Blocking using Suffix Arrays and Bloom Filters
		
			TDeVries
		
		
			HKe
		
		
			SChawla
		
		
			PChristen
		
	
		ACM Transactions on Knowledge Discovery from Data (TKDD)
		
			5
			2
			
			2011
		
	
* 
	
		Creating Probabilistic Databases from Duplicated Data
		
			OHassanzadeh
		
		
			RJMiller
		
	
		The VLDB Journal-The International Journal on Very Large Data Bases
				
			2009
			18
			
		
* 
	
		Aspects of object Merging
		
			ABronselaer
		
		
			GDe Tr´e
		
	
		Annual Meeting of the North American Fuzzy Information Processing Society (NAFIPS)
				
			2010
			
		
* 
	
		Adaptive Windows for Duplicate Detection
		
			UDraisbach
		
		
			FNaumann
		
		
			SSzott
		
		
			OWonneberg
		
	
		Proceedings of IEEE 28th International Conference on Data Engineering (ICDE)
				IEEE 28th International Conference on Data Engineering (ICDE)
		
			2012
			
		
* 
	
		Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies
		
			FNaumann
		
		
			ABilke
		
		
			JBleiholder
		
		
			MWeis
		
	
		International Journal of Engineering Research and Technology
		
			3
			1
			
			2014. 2006
		
	
	Engineering and Management


* 
	
		Data Fusion
		
			JBleiholder
		
		
			FNaumann
		
	
		ACM Computing Surveys (CSUR)
		
			41
			1
			
			2009
		
	
* 
	
		Semi-Supervised Heterogeneous Fusion for Multimedia Data Co-Clustering
		
			LMeng
		
		
			A.-HTan
		
		
			DXu
		
	
		IEEE Transactions on Knowledge and Data Engineering
		
			26
			9
			
			2014