Vision-Based Deep Web Data Extraction For Web Document Clustering

Authors

  • Dr. M. Lavanya

  • Dr.M.Usha Rani

Keywords:

Noise Chunk, cosine similarity, Title word Relevancy, Keyword frequency-based chunk selection, Fuzzy c-means clustering (FCM)

Abstract

The design of web information extraction systems becomes more complex and time-consuming. Detection of data region is a significant problem for information extraction from the web page. In this paper, an approach to vision-based deep web data extraction is proposed for web document clustering. The proposed approach comprises of two phases: 1) Vision-based web data extraction, and 2) web document clustering. In phase 1, the web page information is segmented into various chunks. From which, surplus noise and duplicate chunks are removed using three parameters, such as hyperlink percentage, noise score and cosine similarity. Finally, the extracted keywords are subjected to web document clustering using Fuzzy c-means clustering (FCM).

How to Cite

Dr. M. Lavanya, & Dr.M.Usha Rani. (2012). Vision-Based Deep Web Data Extraction For Web Document Clustering. Global Journal of Computer Science and Technology, 12(5), 13–21. Retrieved from https://computerresearch.org/index.php/computer/article/view/464

Vision-Based Deep Web Data Extraction For Web Document Clustering

Published

2012-03-15