1.

Abstract-Web Usage Mining deals with the understanding of user behavior while interacting with the website by using various log files. The whole process of Web Usage Mining gets completed in three phases namely Data Preprocessing, Pattern Discovery and Pattern Analysis. Data Preprocessing is important because it takes 80% of the time of the whole process of Web Usage Mining. Data Preprocessing involves Data Cleaning, User Identification, and Session Identification.

In Session Identification we find out the set of pages visited by a user within the duration of one particular visit to a website, also called as Sessionization.

In paper [1], we proposed a new method for session construction. As the size of log files are very large so there is a requirement of an approach for Session Identification by which processing time of our proposed method will be reduced to a great extent.

In this paper, we used Map-reduce method to calculate sessions in which we combine both time and user navigation method. This approach is faster than the existing approach because we have performed the whole process in distributed environment.

Keywords: web mining, web server logs, web usage mining (WUM), map reduce, session identification.

2. I. Introduction

eb Usage Mining deals with observing user behavior, while interacting with web site, by accessing various log files to extract knowledge from them. This knowledge can be applied for reorganizing the website contents by giving a personalization and recommendation that is more efficient as compared to previous one by improving the links and navigation which in turns increase the rate of advertisement. This will results the users to access the website in a comfortable manner which obviously generate more revenue to them. [2] This scheme comprises of three steps as data preprocessing, data mining and pattern analyzing. Data preprocessing contains three steps as data cleaning, user identification, session identification. Session identification is an crucial step in data processing of web log mining. A session is defined as multiple requests made by a user for a single navigation. A user may have a single or multiple sessions during a particular period. Basically sessions are identified either by Time based method or by Navigation based method.

Author ?: Department of Computer Science, SGSITS Indore (M.P.), India. e-mail: [email protected] Author ?: Department of Information Technology, SGSITS Indore (M.P.), India. e-mail: [email protected] Here, we proposed a unique approach for user session identification by blending Time based method with Navigation based method to get better results.

To increase the pace of Sessionization, the process is performed on distributed systems using Map-reduce. Map-reduce [3] is a programming model and an associated implementation for processing and generating large data sets that supports fault tolerance, automatic parallelization, scalability, and data localitybased optimizations. Users define a Map function that will use this key/value pair for processing the data to generate a set of intermediate key/value pairs and a Reduce function will be called that concatenates all intermediate values related with the same intermediate key.

3. II. Motivation

Map Reduce is a programming model and an associated implementation for processing and generating large data sets. This process takes a set of input key and value pairs and generates an list of key and value pairs. The user of the Map-Reduce library classifies this calculation as two function as map and reduce functions.

The Map function takes a pair of input and generates a list of intermediate key and value pairs. The values grouped with the help of the Map-Reduce library is fed to the Reduce function.

The Reduce function accepts the output that was generated by the library as value and key pair, merge them to produce a small set of values e.g. zero or one value. The intermediate values that were produced during invocation are feded into the Reduce function with the help of an iterator. This will enable the user to handle large set of values so that it will be stored easily in the memory.

4. III. Proposed Approach

In order to enhance the performance of the proposed method in [1], we have used Map-Reduce method to lower the session generation time.

We have applied Map-Reduce on the timebased method, maximal forward sequence method and our proposed method [1]. The results that were generated during this approach has tremendously reduces the session generation time as it was fasten up by the Map and Reduce function.

5. V. Conclusion

The information available on the web is increasing day by day in a fast manner. This lets the user to have a lot of data to access freely on the web. Our method have generated sessions that took less time comparable to the existing method. The experiment on 1GB, 2GB, and 4GB data shows that the new method proposed in [1] generates more sessions (3102) than the traditional Time Based Method (2875) and Maximal Forward Sequence Method (2742). As per the result shown in Table-1 with the proposed approach, this process takes less time in completion because of Map Reduce method.

Web usage Mining: Web user Session Construction using Map-Reduce Neha Sharma ? & Pawan Makhija ? — Figure 1.

Figure-1 shows the graphical representation ofTable-1 for comprising the time requirement in completing the process by an existing method and the proposed method. — Figure 2. WFigure- 1

Appendix A

Appendix A.1

The experiment is performed on the log data of www.smartsync.com on 8 Dec 2013.

Appendix A.2 IV. Testing and Results

The input data that was supplied during our proposed method are the access log files of the www.smartsync. com web server. Because data of log files are large, we have taken the log dataset of only one day (dated 8 Dec 2013) of size 1 GB, 2 GB, and 4 GB. Table-1 shows the time required for completing the process on a single system and multiple systems (ET=Execution Time):

Appendix B

A Novel Technique for Sessions Identification in Web Usage Mining Preprocessing. Antony Dr , V Selvadoss Thanamani , Chitraa . International Journal of Computer Applications November 2011. 34 (9) .
A session identification algorithm based on frame page and page threshold. Fang Yuankang , Huang Zhiqui . Computer Science and Information Technology (ICCSIT), 3rd IEEE International Conference, 2010.
Optimum algorithm for generation of user session sequences using server side web user logs, G Arumugam , S Sugana . 2009. IEEE.
Dynamic Timeout-Based A Session Identification Algorithm, He Xinhua , Wang Qiong . IEEE 2011.
MapReduce: Simplifed Data Processing on Large Clusters. Jeffrey Dean , Sanjay Ghemawat . OSDI 2004.
Cut-off Time Calculation for User Session Identification by Reference Length. Jozef Kapusta , Michal Munk , Martin Drlík . IEEE 2012.
Web Usage Mining:A novel approach for web user session construction" GJCST, Neha Sharma , Pawan Makhija . 2015. 15.
Web Usage Mining: A Review on Process, N Nirali , Madhak , M Trupti , Jayesh N Kodinariya , Rathod . IEEE 2013.
Web user session reconstruction using integer programming. R F Dell . International Conference on Web Intelligence and Intelligent Agent Technology, 2008.
Web mining:Information and Pattern Discovery on the World Wide Web. Robert , Bamshed Cooley , Jaideep Mobasher , Srinivastava . International conference on Tools with Artificial Intelligence, (Newport Beach
) 1997. IEEE. p. .
Web usage Mining: Web user Session Construction using Map-Reduce,
Web usage Mining: Web user Session Construction using Map-Reduce 1, 4932 p. 19312.
Linear Time Algorithms for Finding Maximal Forward References. Zhixiang Chen , Richard H Fowler , Ada Wai-Chee Fu . Intl Conf On Info Tech: Coding and Computing (ITCC03), Proc. of the, 2003. IEEE.

Web usage Mining: Web user Session Construction using Map-Reduce

Table of contents

1.

Appendix A

Appendix A.1

Appendix A.2 IV. Testing and Results

Appendix B