# I. Introduction

adoop is an open source big data processing platform designed to process large volume of data. The data is kept in form of files in Hadoop distributed file system (HDFS). A map job is spawned on a java virtual machine (JVM) instance for each file in HDFS. The file data is copied to a memory block and the block is passed to map task. In addition, a object instance is created for each file in the Namenode of Hadoop to facilitate processing. When the file size is more than or equal to block size, maximum performance gain in achieved in terms of number of maps spawned and the meta data storage overhead at Namenode. In case of IoT applications, the data files are small (less than 2KB) and when these files are stored in HDFS for data processing, it affects the Hadoop performance [1][2]. On one hand, it drastically increases the storage overhead at Namenode for object bookkeeping [3]. On another hand it exhausts the computational resources by spawning multiple map tasks which only lasts for smaller duration to process small files. The time spent in bootstrapping the map task becomes higher than data processing time in case of small files. Various solutions have been proposed addressing the Hadoop small file problem. The existing solutions can be categorized as: (i) file merging solutions, (ii) file caching solutions, (iii) optimizing Hadoop cluster structure and (iv) Map task optimizations. In file merging solutions, pre-treatment of small files is done to form a big file and this big file is stored in HDFS. In file caching solutions, files are sent to a file queue, and when queue size crosses threshold files are sent to processing in a systematic manner. In Hadoop cluster structure optimization solutions, hierarchical memory structure is created combining cache and HDFS memory to reduce the overhead due to single name node. In map task optimization solution, number of JVM instances spawned for map tasks are reduced and shared.

This work does a critical analysis on various solutions in the above four categories of file merging, file caching, Hadoop cluster structure optimization and map task optimization. The effectiveness of each of the solutions in terms of storage and computation are analyzed and their open issues are identified. Based on the open issues, a prospective solution framework is designed and detailed.


# II. Survey

Ahad et al [4] proposed a dynamic merging strategy based on the file type for Hadoop. Dynamic variable size portioning is applied to blocks and the file contents are fitted to blocks using next fit allocation policy. By this way large file is created and saved to HDFS. In addition, authors also secured the block using Twofish cryptographic technique. The solution reduced name node memory, number of data blocks and processing time. Merging was done only based on file types without considering the context and their semantic relation. Siddiqui et al [5] proposed a cache based block management technique for Hadoop as a replacement for default Hadoop Archives (HAR). A logical chain of small files is built and transferred to data blocks. In addition, efficient read/write on blocks was facilitated using block manager. Though the solution achieved more than 92% space utilization of data blocks, small files are merged only based on size, without considering the semantic relations and content characteristics. Zhai et al [6] built a index based archive file to solve the small file problem in Hadoop. The small files are merged to large file and metadata record is created to retrieve each file content. Meta data records are arranged into buckets. An order preserving hash is created over metadata records. The hash and the metadata records are in turn written to a index file. The index files helps to retrieve the file contents for processing. This method is able to save atleast 11% disk space but the solution access efficiency becomes lower with large number of small files. Also the indexing does not support streaming inputs. Cai et al [7] proposed a file merging algorithm based on two factors of distribution of the files and the correlation of the file. Correlation between files is built based on their history of access and the highly correlated files are kept in the same block. Through experiments, author found that placing highly correlated files in same block improved the speed up. The correlation is not based on content characteristics so over a period of time, performance can reduce. Choi et al [8] integrated combinedfileinputformat and JVM reuse to solve the small file problem. Small files are combined till block size and passed to map task. JVM instances are reused for the map task , so they overhead of JVM bootstrap is minimized. Though the integration reduces the computational overhead, the approach combined files in order without considering their semantics. Also the memory buildup due to JVM reuse can crash the tasks due to inefficient memory management. Peng et al [9] combined merging and caching techniques to solve the small file problem. User based collaborative filtering is applied to learn the correlation between the files. Files with higher correlation are merged into single large file. Remote procedure call (RPC) requests to fetch the block information about the files are reduced by caching the access requests and looking into cache for the blocks before placing RPC requests. By this way, authors were able to reduce the file access time by 50% and increase storage utilization by 25% compared to default Hadoop. The scheme does not works well for streaming data, as the correlation model proposed in this work is not adaptive to streaming data. Niazi et al [10] proposed a new technique called inode stuffing to solve the small file problem. For small files, the metadata and data block are combined and decoupling is maintained only for large files. The approach is not scalable as it increases the metadata storage overhead at Namenodes. Jing et al [11] proposed a dynamic queue method to solve the small file problem. The files are first classified using the period classification algorithm. The algorithm calculates similarity score based on sentence similarity between two documents. The similar files are then merged to large file using multiple queues for specific file sizes. Authors also used file pre-fetching strategy to improve the efficiency of file access. Analyzing similarity between pairs is a cumbersome task for large number of files. Sharma et al [12] proposed a dual merge technique called Hash Based-Extended Hadoop Archive to solve the small file problem in Hadoop. The small files are merged using two level compaction. This reduces the storage overhead at Namenode and increase the data block space utilization at Datanodes. File access is made efficient using two level hash function. The proposed solution is atleast 13% faster compared to default Hadoop. The files were merged without considering the content characteristics and their semantics. Wang et al [13] combined merging and caching to solve the small file problem in Hadoop. Authors proposed a equilibrium merger queue algorithm to merge small files to Hadoop block size and then merged file is saved to HDFS. Indexing is built to access small files. To reduce the communication overhead between the client and Namenode for small file access, pre-fetched cache is used. With the cache, the number of RPC calls to name node is reduced. The memory consumption at Namenode drastically reduced in the proposed solution compared to default Hadoop Archives. Contents were merged without considering their content characteristics and semantic correlation. Ali et al [14] proposed a enhanced best fit merging algorithm to merge small files based on type and size. The merging is done till Hadoop block size is reached and merged file is saved to HDFS. Author found that merging improved Hadoop storage utilization by 64% but the file access time was higher in this work. Prasanna et al [15] compressed many small files into a zip file to the size of Hadoop data block and saved to disk. This increased the disk utilization of data nodes and name nodes. But the computational overhead in compressing stage and decompressing during processing is higher. Huang et al [16] addressed the small file problem for the case of images in Hadoop. A two level model was proposed specific to medical images. The images were grouped at first level based on series and next level based on examination. The grouped images are saved to data blocks in HDFS. Indexing and pre-fetching is done to done is reduce the access time for small image files. The pre-fetching algorithm did not have higher cache hit. Renner et al [17] extended the Hadoop archive to appendable file format to solve the small file problem. Small files are appended to existing archive data files whose block size is not completely used. Authors used first fit algorithm to select the data blocks. In addition indexing is done to facilitate faster access. Red black tree structure is used for indexing for efficient lookup. Though this scheme improved the data block utilization, appending is done without considering content characteristics and semantic similarity. Liu et al [18] proposed a file merging strategy based on content similarity. Files are converted to vector space features and correlation between the features is measured using cosine similarity. When cosine similarity is greater than threshold, files are merged. In addition authors used pre-fetching and caching to speed up the file access. Constructing a global feature space for streaming data is difficult and thus this approach is not suitable for streaming data.Lyu et al [19] proposed an optimized merging strategy to solve small file problem. The small files are merged based on size in such that way block size is fully utilized. In addition authors used pre-fetching and caching to increase the access speed. Only block size utilization was considered as the only criteria for merging without considering content characteristics and semantic relations. Similar to it Mu et al [20] proposed an optimization strategy to maximally fill the existing Hadoop archive by appending small files. In addition author also used secondary index to speed up the execution of file access. But here too merging was done without considering content characteristics and semantic relation. Wang et al [21] used probabilistic latent semantic analysis to determine the user access pattern and based on it small files are merged to a large file and placed in HDFS. In addition author also improved the pre-fetching hit ratio based user access transition pattern. Both the strategies improvised the speed of access and data block utilization. But this scheme is not suitable for multi user environment as for each user, a merging order must be kept and this increases the storage overhead. He et al [22] merging the small files based on balance of data blocks. The aim was to increase the data block utilization. Merging did not consider content characteristics and their semantic relation. Fu et al [23] proposed an flat storage architecture to handle the small files. In this scheme, both files and meta data are collocated with meta size fixed for any number of small files. This is facilitates by meta data having only pointer to related information in its index. But the scheme is not suited for Hadoop as collocation causes higher access overhead for large files. Tao et al [24] merged small files to large file and built a linear hash to small files to speed up access. File size was the only criteria considered for merging. Bok et al [25] integrated file merging and caching to solve the small file problem. Author used two level of cache for small files, so that access requests to - Cai et al [7] file merging algorithm based on two factors of distribution of the files and the correlation of the file The correlation is not based on content characteristics Choi et al [8] integrated combinedfileinputformat and JVM reuse to solve the small file problem memory buildup due to JVM reuse can crash the tasks due to inefficient memory management Peng et al [9] combined merging and caching techniques to solve the small file problem

The scheme does not works well for streaming data, as the correlation model proposed in this work is not adaptive to streaming data Niazi et al [10] Coupling both meta data and small file together.

The approach is not scalable as it increases the metadata storage overhead at Namenodes Jing et al [11] Files classified using the period classification algorithm and merged based on similarity Analyzing similarity between pairs is a cumbersome task for large number of files Sharma et al [12] Hash Based-Extended Hadoop Archive to solve the small file problem

The files were merged without considering the content characteristics and their semantics.

Wang et al [13] combined merging and caching to solve the small file problem Contents were merged without considering their content characteristics and semantic correlation Ali et al [14] enhanced best fit merging algorithm to merge small files based on type and size. file access time was higher in this work

Huang et al [16] A two level model was proposed specific to medical images

The pre-fetching algorithm did not have higher cache hit Renner et al [17] Small files are appended to existing archive data files Appending is done without considering content characteristics and semantic similarity Liu et al [18] File content based merging Constructing a global feature space for streaming data is difficult and thus this approach is not suitable for streaming data Lyu et al [19] optimized merging strategy to solve small file problem.

Only block size utilization was considered as the only criteria for merging without considering content characteristics and semantic relations Wang et al [21] probabilistic latent semantic analysis to determine the user access pattern and based on it small files are merged to a large file scheme is not suitable for multi user environment as for each user, a merging order must be kept and this increases the storage overhead He et al [22] merging the small files based on balance of data blocks

Merging did not consider content characteristics and their semantic relation Fu et al [23] flat storage architecture collocating metadata and file in same object the scheme is not suited for Hadoop as collocation causes higher access overhead for large files Tao et al [24] merged small files to large file and built a linear hash to small files to speed up access File size was the only criteria considered for merging Bok et al [25] integrated file merging and caching to solve the small file problem

The merging was based only on size without considering the content characteristics and semantic similarity Caching on global context can provide better performance for some users and can give worst performance for other users. To solve this access time discrepancy among the users, personalized caching strategy must be employed.

Steaming Support: Most of the merging schemes does not handle the steaming data effectively. Streaming data content similarity cannot be computed effectively using vector space modeling and their merging can become ineffective. Merging based on streaming arrival patterns has not been considered in earlier works.


# IV. Research Directions

Based on the open issues identified, a prospective framework for further research is presented in Figure 1.

The framework addresses three problem areas of context specific merging, personalized access and streaming support.

Context Specific Merging: It can be facilitated and made adaptive using machine learning. Based on the application contexts and inherent data characteristics the files to be merged can be found. Blocks can be categorized based on context and small files can be categorized based on context. Context based merging is the realized to merge files and blocks based on context similarity. Instead of flat context, hierarchical context can be learnt automatically from file summarization. File summarization strategies specific to file types can be proposed to identify the context to be associated with files and blocks.

Personalized Access: User can be clustered based on their content access patterns over a temporal duration and multiple caches can be maintained for each user group. Also the cache item management can be based on multi criteria optimization instead of LRU mechanisms. The items to pre-fetch can be identified based on context associated with files. By this way access speed up can be increased and optimized specific to each user group.

Streaming Support: To support streaming data, the context must be learnt dynamically in a light weight manner and association of small file to blocks must be done based on context. To learn context in a light weight manner, the streaming data characteristics and their arrival patterns must be used.


# V. Conclusion

This survey made a critical analysis of existing solutions for small file problem in Hadoop. The solutions were analyzed in four categories of file merging solutions, file caching solutions, optimizing Hadoop cluster structure and Map task optimizations. Based on the survey, three open issues of context specific merging, personalized access and streaming support are identified. Prospective solutions to these three open issues were identified and a solution roadmap for further exploration in this area was documented.
Critical Analysis of Solutions to Hadoop Small File ProblemYear 202325Volume XXIII Issue II Version I( ) CGlobal Journal of Computer Science and TechnologyFigure 1: Research direction framework© 2023 Global Journals
1WorkSolution for Small file ProblemGapMerging was done only based on file typesAhad et al [4]dynamic merging strategy based on the file typewithout considering the context and theirsemantic relationsmall files are merged only based on size,Siddiqui et al [5]cache based block management techniquewithout considering the semantic relationsand content characteristicsZhai et al [6]a index based archive file with order preserving hash for speedupDoes not support streaming
1Critical Analysis of Solutions to Hadoop Small File ProblemYear 2023Volume XXIII Issue II Version I( ) CGlobal Journal of Computer Science and Technology© 2023 Global Journals
			
© 2023 Global Journals

		
* 
	
		
		Small size problem in Hadoop
		
	
* 
	
		Solving Small size problem in Hadoop
		
		
* 
	
		An optimized approach for storing and accessing small files on cloud storage
		
			BoDong
		
		
			QinghuaZheng
		
		
			FengTian
		
		
			Kuo-MingChao
		
		
			RuiMa
		
		
			RachidAnane
		
	
		Journal of Network and Computer Applications
		
			35
			
			2012. 2012
			Elsevier
		
	
* 
	
		Dynamic Merging based Small File Storage (DM-SFS) Architecture for Efficiently Storing Small Size Files in Hadoop
		
			MohdAhad
		
		
			RanjitBiswas
		
		1626-1635.10.1016/j.procs.2018.05.128
	
	
		Procedia Computer Science
		
			132
			2018
		
	
* 
	
		Pseudo-Cache-Based IoT Small Files Management Framework in HDFS Cluster. Wireless Personal Communications
		
			Isma&Siddiqui
		
		
			NawabQureshi
		
		
			Muhammad Faseeh & Chowdhry
		
		
			Bhawani
		
		
			MohammadUqaili
		
		113.10.1007/s11277-020-07312-3
		
			2020
		
	
* 
	
		Hadoop Perfect File: A fast and memory-efficient metadata access archive file to face small files problem in HDFS
		
			Yanlong&Zhai
		
		
			Jude&Tchaye-Kondi
		
		
			Kwei-Jay &Lin
		
		
			Zhu
		
		
			&Liehuang
		
		
			Tao
		
		
			&Wenjun
		
		
			Du
		
		
			MohsenXiaojiang & Guizani
		
		156.10.1016/j.jpdc.2021.05.011
	
	
		Journal of Parallel and Distributed Computing
		
			2021
		
	
* 
	
		An optimization strategy of massive small files storage based on HDFS
		
			XunCai
		
		
			Chen
		
		
			&Cai
		
		
			YiLiang
		
		10.2991/jiaet-18.2018.40
		
			2018
		
	
* 
	
		Improved performance optimization for massive small files in cloud computing environment
		
			CChoi
		
		
			CChoi
		
		
			JChoi
		
	
		Ann Oper Res
		
			265
			
			2018
		
	
* 
	
		
			Jian-Feng &Peng
		
		
			Wen-Guo &Wei
		
		
			Hui-Min &Zhao
		
		
			Dai
		
		
			Gui-Yuan &Qing-Yun & Xie
		
		
			JunCai
		
		
			KejingHe
		
		Proceedings.10.1007/978-3-030-00563-4_50
	
	
		Hadoop Massive Small File Merging Technology Based on Visiting Hot-Spot and Associated File Optimization: 9th International Conference
		Xi'an, China
		
			2018. 2018. July 7-8, 2018
		
	
* 
	
		Size Matters : Improving the Performance of Small Files in Hadoop', presented at the Middleware'18
		
			SNiazi
		
		
			MRonström
		
		
			SHaridi
		
		
			JDowling
		
		
			2018
			ACM
			14
			Rennes, France
		
	
* 
	
		An optimized method of HDFS for massive small files storage
		
			Weipeng&Jing
		
		
			Tong
		
		
			Guangsheng&Danyu & Chen
		
		
			ChuanyuZhao
		
		
			LiangkuanZhu
		
		15.21-21.10.2298/CSIS171015021J
	
	
		Computer Science and Information Systems
		
			2018
		
	
* 
	
		A Dynamic Repository Approach for Small File Management With Fast Access Time on Hadoop Cluster: Hash Based Extended Hadoop Archive
		
			VSSharma
		
		
			AAfthanorhan
		
		
			NCBarwar
		
		
			SSingh
		
		
			HMalik
		
	
		IEEE Access
		
			10
			
			2022
		
	
* 
	
		MOSM: An approach for efficient storing massive small files on Hadoop
		
			KWang
		
		
			YYang
		
		
			XQiu
		
		
			ZGao
		
	
		2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)
		Beijing, China
		
			2017
			
		
* 
	
		Enhanced best fit algorithm for merging small files
		
			AAli
		
		
			NMMirza
		
		
			MKIshak
		
	
		Computer Systems Science and Engineering
		
			46
			1
			
			2023
		
	
* 
	
		Optimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System
		
			LPrasanna
		
		
			Kumar
		
	
		International Journal on Recent and Innovation Trends in Computing and Communication
		
			4
			2
			
			Feb. 2016
		
	
* 
	
		Hadoop-Based Medical Image Storage and Access Method for Examination Series
		
			XinHuang
		
		
			WenlongYi
		
		
			JiweiWang
		
		
			ZhijianXu
		
	
		Mathematical Problems in Engineering
		
			2021
			2021
		
	
	Article ID 5525009, 10 pages


* 
	
		Addressing Hadoop's Small File Problem With an Appendable Archive File Format
		
			ThomasRenner
		
		
			JohannesMüller
		
		
			LauritzThamsen
		
		
			OdejKao
		
	
		Proceedings of the Computing Frontiers Conference (CF'17)
		the Computing Frontiers Conference (CF'17)New York, NY, USA
		
			Association for Computing Machinery
			2017
			
		
* 
	
		Storage-Optimization Method for Massive Small Files of Agricultural Resources Based on Hadoop
		
			JunLiu
		
		23.634-640.10.20965/jaciii.2019.p0634
	
	
		Journal of Advanced Computational Intelligence and Intelligent Informatics
		
			2019
		
	
* 
	
		An optimized strategy for small files storing and accessing in HDFS
		
			YLyu
		
		
			XFan
		
		
			KLiu
		
	
		Proc. IEEE Int. Conf. CSE, IEEE Int. Conf. EUC
		IEEE Int. Conf. CSE, IEEE Int. Conf. EUC
		
			Jul. 2017
			
		
* 
	
		The optimization scheme research of small files storage based on HDFS
		
			QMu
		
		
			YJia
		
		
			BLuo
		
	
		Proc. 8th Int. Symp. Comput. Intell. Design
		8th Int. Symp. Comput. Intell. Design
		
			Dec. 2015
			
		
* 
	
		An effective strategy for improving small _le problem in distributed file system
		
			TWang
		
		
			SYao
		
		
			ZXu
		
		
			LXiong
		
		
			XGu
		
		
			XYang
		
	
		Proc. 2nd Int. Conf
		2nd Int. Conf
		
			Apr. 2015
			
		
* 
	
		Optimization strategy of Hadoop small_le storage for big data in healthcare
		
			HHe
		
		
			ZDu
		
		
			WZhang
		
		
			AChen
		
	
		J. Supercomput
		
			72
			10
			
			Aug. 2016
		
	
* 
	
		Performance optimization for managing massive numbers of small files in distributed file systems
		
			SFu
		
		
			LHe
		
		
			CHuang
		
		
			XLiao
		
		
			KLi
		
	
		IEEE Trans. Parallel Distrib. Syst
		
			26
			12
			
			Dec. 2015
		
	
* 
	
		LHF: A new archive based approach to accelerate massive small _les access performance in HDFS
		
			WTao
		
		
			YZhai
		
		
			JTchaye-Kondi
		
	
		Proc. 5th IEEE Int. Conf. Big Data Service Appl
		5th IEEE Int. Conf. Big Data Service Appl
		
			Apr. 2019
			
		
* 
	
		An efficient distributed caching for accessing small files in HDFS
		
			KBok
		
		
			HOh
		
		
			JLim
		
		
			YPae
		
		
			HChoi
		
		
			BLee
		
		
			JYoo
		
	
		Cluster Comput
		
			20
			4
			
			Dec. 2017