Abstract

Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. Hence we also review the literature on duplicate detection and data fusion (remove and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user.

How to Cite
R S, VENUGOPAL K R, IYENGAR S S, PATNAIK L M, Ramya. Feature Extraction and Duplicate Detection for Text Mining: A Survey. Global Journal of Computer Science and Technology, [S.l.], jan. 2017. ISSN 0975-4172. Available at: <https://computerresearch.org/index.php/computer/article/view/1459>. Date accessed: 15 aug. 2022.