# Introduction

he developing wonder called big data is compelling various changes in organizations and different associations. Many battle just to deal with the gigantic informational collections and nonconventional information structures that are commonplace of big data.

Large information management is about two concepts: big data and data management, plus how the two work together to accomplish business and innovation objectives.

According to Ray (2018) Big Data refers to a large volume of diverse, complex and fast-changing data, derived from new data sources. The data sets are so large that is very difficult to manage by the traditional data processing software or the traditional software management (Manyika et al., 2011;Gürsakal, 2014).

Big data is first about data volume, namely large datasets measured in tens of terabytes, or sometimes in hundreds of terabytes or pet abytes. Also, big Data is so huge and complex that it is impossible for traditional systems and traditional data warehousing tools to process and work on them. Before the term enormous information became regular speech, we discussed Very Large Databases (VLDBs). VLDBs usually contain exclusively structured data, managed in a database management system (DBMS).

Notwithstanding exceptionally huge datasets, large information can likewise be a mixed blend of organized information (social information), unstructured information (human language content), semi-organized information (RFID, XML), and spilling information (from machines, sensors, Web applications, and web-based social networking). The term multi-organized information alludes to informational collections or information conditions that incorporate a blend of these information types and structures. (Gantz and Reinsel, 2011).


# T

With the expansion in the utilization of big data in business, numerous organizations are grappling with privacy issues. Information protection is a risk, consequently organizations must be on security cautious. Security is the case of people, gatherings, or organizations to decide for themselves when, how, and to what degree data about them is imparted to other people. In contrast to security, privacy ought to be considered as a benefit; in this manner it turns into a selling point for the two clients and different partners. There ought to be a harmony between data privacy and national security.


# II.


# Related Work

Lu et al., (2014) made a methodology towards the proficient and protection saving processing in the big data period, and it misuses the new difficulties of big data in security safeguarding. At first, it characterizes the general engineering of big data examination and finds the security necessities in big data. At that point, it discovers a proficient and privacy preserving cosine closeness figuring convention. The limitation of the work is that it needs significant research efforts for addressing unique privacy issues in some specific big data analytics.

Xu et al., (2016) structured a system named "Rampart framework" for privacy safeguarding. It comprises of techniques in particular anonymization, recreation, change, provenance, understanding, exchange and limitation to forestall outside interruption. The system endeavored to give high need to keep up the harmony between information utility and privacy however recommended that more ways are to be investigated to ensure protection against different dangers.

Shrivastva et al., (2014) broke down how much the differential privacy approach is appropriate for big data security conservation and introduced different elements that assume key job in big data security safeguarding. Among the different methodologies, differential privacy is the best appropriate for big data as it is liberated from the imperfections of different methodologies. Plus, differential privacy looks for balance among utility and security. A framework of perturbation is introduced to accomplish the differential privacy.

Al-Aqeeli and Alinfie (2015) researched some security saving issues of big data with regards to half breed distributed computing and assessed a few systems, for example, Airavat, Sedic, Sac FRAPP and Hyper-1 dependent on Map Reduce from the point of view of versatility, cost and similarity. It was recorded that anonymization, encryption, differential privacy are the productive strategies for ensuring protection of information. The last investigation shows that the featured structures experiences constraints, for example, information contortion and none of them is completely fit for privacy preservation. Mehmood et al., (2016) introduced existing protection safeguarding instruments in the different life patterns of big data, for example, data generation (encryption and access limitations), information stockpiling (hybrid and private mists) and information handling (generalization, suppression, anatomization, permutation and perturbation) and different difficulties of saving security in large information. These techniques were portrayed as for the variables of versatility, security, time, proficiency and utility. Different dangers engaged with the encryption, anonymization and capacity of information in the cloud were likewise researched. At the point when these strategies are applied, security is ensured however the information may lose the importance in reality and thus the utility and criticalness. For information distributing, a calculation must consider legitimate exchange off among utility and security as the information is inclined to any assaults. Along these lines the strategies/methods must be adjusted or stretched out to deal with the large information in a proficient way. Yan et al., (2016) proposed a pragmatic plan to deal with the encoded big data in cloud with de duplication dependent on possession challenge and Proxy Re-Encryption (PRE). As recognized by Jian et al., (2016), the constraint of their work is that Convergent Encryption (CE) is dependent upon an innate security restriction for example powerlessness to disconnected animal power word reference assault.

Sedayao et al., (2014) introduced a contextual investigation of anonymization in an endeavor identifying the necessities and execution detail for saving security of enormous information. Anonymized informational collections must be painstakingly broke down, estimated and tried whether they are inclined to any assaults since it is more than covering or generalization. The creators recommended the utilization of Hadoop to break down and get helpful outcomes from the big data. The analyses are led with static informational index, yet it ought to be stretched out for continuous informational indexes. The work couldn't reason that the anonymized information is completely liberated from any sort of assaults.

Zakerdah and Aggarwal (2015) proposed a methodology towards protection safeguarding information mining of exceptionally monstrous informational indexes utilizing map reduce. They study two most broadly utilized security models k-anonymity and l-diversity variety for anonymization, and present test results outlining the effectiveness of the methodology. The constraint of their work is that generalization cannot deal with high dimensional information, it decreases information utility. Perturbation decreases utility of information. Zhang et al., (2013) proposed Cloud Safe to redesign availability and mystery of the set aside information in the cloud through scrambling and encoding data into a couple of disseminated stockpiling providers. Cloud Safe offers a cloud-based individual electronic asset safe help which passes on the critical assets between a couple of cloud providers by using destruction coding and cryptography. As per Zhang et al., (2013), the accessibility improves because of utilizing eradication coding to disperse the information on a few cloud suppliers, so as to recoup information get to when a supplier falls flat.AES was utilized for scrambling and unscrambling information to keep information secrecy.

Zhang et al., (2014) researched the versatility issue of multidimensional anonymization over big data on cloud, and proposed an adaptable Map Reduce based methodology. The flexibility issues of finding the center on account of its inside activity in multidimensional allotting was investigated and significantly versatile Map Reduce based computation was proposed for finding the center and histogram strategy. Logically number of investigations on datasets were coordinated which was removed from real datasets, and the exploratory results show that the flexibility and cost sufficiency of multidimensional anonymization plan can be improved basically over existing techniques, anyway ensuring insurance protecting of immense extension educational assortments regardless of everything needs wide assessment.

Pramanik et al., (2016) presented a conceptual framework that integrates and improves technologies for preserving big data privacy. The proposed model empowers the structure of a dependable protection framework for a given e-government procedure and comprises of three significant modules: a) Big information assortment, b) Information extraction, and c) Anonymization module. In this work, a Conditional Random Field (CRF) classifier was conveyed for extricating distinguishing characteristics, and kanonymization strategy for de-recognizing the separated information through insignificant speculation and concealment. The creators likewise introduced a lot of primer trial results indicating the viability of the proposed structure dependent on some security assessment measurements.


# III.


# Design Methodology

The architecture named Big Data-ARpM (Big Data Access Restriction and Privacy Mechanism) is defined by the collection or gathering of data with high velocity, volume and different varieties, classification of the gathered data, storing the data securely and restricting the access to data from within and out of the systems. Figure 1a is a physical architecture that gives an insight into the operational structure of Big Data-ARpM, Figure 1b shows the internal structures of the Access Restriction and the Key Management Module, while Figure 1c shows the internal structure of the request management module. The architecture is designed to run on a distributed server environment and to store and retrieve data from a parallel database system; this is because of the high velocity, volume and different varieties of data.   Big Data-Arp Mretrieves input from synchronous multiple data sources, these input are raw and however need to be pre-processed and further classified before its then stored securely and await a request for delivery, because the data may contain sensitive information of so many entities, people and organization, releasing the data without anonymizing them may be a very great disaster Big , Data-Arp Mhas a well-structured internal component that facilitates all the processes, the structures and their respective functions are;


# b) Data Pre-processor Module

Data pre-processing is an important step in data gathering, data gathering are mostly loosely controlled, resulting in out of range value (Age: -100) and impossible data combinations e.g. (Sex: Male, Pregnant: Yes). Data that are being gathered and input from the source (WebCrawler) are considered to be noisy-data, however the Data Pre-Processor Module contain a data cleaning component that check each entity of the data for conformity, the output from the data pre-process is a processed and filtered data. 


# Data Classification Module

This module deals with the classification of data due to the sensitivity of such data. The role assigned to the user will determine what class of data such data users can access. There are three basic levels of classifications in this module, which are:

? Normal Level: Users assigned to this level can only view attributes such as the Quasi-identifier (QID).

(QID) is a set of attributes such as zip code, gender, a birth date in which the combination of this attributes could potentially distinguish individuals. This level is the least sensitive of all the three levels. 


# Data Preservation Module

The Module consist of two sub modules with the goal of preserving data before release to any user or any third party application to prevent the privacy violation of the data owner. The data passes through the first module that build an aggregated tree of a single sink data from various data coming from various sources of data entries; this reduces the chances of tracing back the data back to the original owner , prim's algorithm was employed to build the tree. The aggregated data is then passed on to the differential privacy module, which introduce a minimum distortion in the information provided by the database system.


# ALGORITHM 3: Differential Privacy Algorithm Input: Level, dp Request Output: DP_response Begin

Step 1: The analyst can make query to the database through this intermediary privacy guard.

Step 2: The privacy guard takes the query from the analyst and evaluate this query and other earlier queries for the privacy risk. After evaluation of the privacy risk.

Step 3: The privacy guard then gets the answer from the database Step 4: Add some distortion to it according to the evaluated privacy risk and finally provide it to the analyst.


# End a) Access Restriction and Key Management module

This modules consists of different sub modules, that coordinates the user or third party application registration, access to data and information in the entire systems, the modules are Due to high velocity and large volume of data that will be passing through the system, this module is designed to handle all split processes in parallel across a cluster of servers and also store and retrieve data across a distributed storage devices, the modules uses Map Reduce, which is programming model for processing large set of data with a parallel and distributed algorithm across a cluster of server.


# c) Request Management Module

The module handles all incoming request from either application users and or third party applications with the aid of the access restriction module which verify the membership of the users, and also analyse the request to know the level of information been requested, check if the level of the user can access the level of information requested. After the user successful verification, the users query/ request passes through differential privacy technique which deny the users direct access to the database.  


# Data Set

A medical dataset was used in the implementation of Big Data-ARpM, the dataset, named Health Care Provider Credential Data was downloaded from an open source called "data.wa.gov". The dataset contains more than a million instances (records) and 12 attributes (Columns).    The computation time is measured in milliseconds and on Big Data platform, the comparisons depict that DP takes lesser computational time in protecting data privacy against k-Anonymity to complete protecting data privacy with 0.4 and 0.45 milliseconds values respectively. This shows that DP is quiet better when privacy protection of data is needed and processing time is to be considered in Big Data analytics.  This illustrate that DP produced more records that is useful for analyst with an increase in privacy level while as the privacy level (k) increased, k-Anonymity produced few records with the utility lesser than DP. For instance, when the privacy level applied is 20 and 60, DP present a total of 100 and 300 records against 40 and 190 records produced by k-Anonymity which shows that as the level of privacy level in DP generate more useful records that can be used for analysis while the confidentiality of data are hiding. Though, DP and k-Anonymity have the same privacy level (k), the utility of records generated from each algorithm differs and depict that DP produced more useful data than k-Anonymity.


# VII.


# Results and Discussions


# Figure 7: Comparison between the noise, privacy and utility levels

In this paper, the approach DP used applied noise variant in achieving its purpose as depicted in Figure 7 showing the comparison between the noise, privacy and utility levels. The privacy level shows the protection of data from being identified, the utility level shows the usefulness of data after nose has been added to user's queries and noise level is privacy balanced added to individual record based on the attribute of each data and the level of each user requesting for data. For example, when the noise level is 10, privacy and utility level are 20 and 95 values respectively revealing that the more noise added, there is increase in the privacy and decrease in the utility of data information presented to the users. This also shows that, there is every possibility that we have the same level of privacy and utility of data as shown where we have noise level to be 35 added, privacy and utility levels having 60 respectively and this means that DP +Noise give rise to privacy preserving of data with a reasonable amount of utility than k-Anonymity algorithm.


# VIII.


# Conclusion

In this study, a conceptual privacy and access restricted framework for securing big data was conceived by designing a data classification scheme according to degree of confidentiality and also designing a privacy preservation technique that enforces data privacy based on data aggregation and differential privacy.

Conclusively, Big Data-ARpM was evaluated based on its utility, scalability, accuracy, sensitivity, specificity and processing time. The results shows that Big Data-ARpM has a very good utility, highly scalable, accuracy of 95.80%, sensitivity of 93.60%, specificity of 98.00% and an execution time of 0.4 milliseconds, as compared with other privacy preservation techniques such as K-anonymity. Hence, the usage of differential privacy technique in Big Data ARpM show that the framework is far better than other frameworks that makes use of other technique.


# IX.


# Recommendation

With the efficient techniques presented in this research work, it is believed that the study can be easily extended to focus more on other type of data such as the semi-structured data and unstructured data. Finally, the presented framework can be built upon to accept larger files of different formats.
1a![Figure 1a: Big Data-ARpM framework](image-2.png "Figure 1a :")
1b![Figure 1b: Access Restriction and the Key Management Module.](image-3.png "Figure 1b :")
1c![Figure 1c: Request Management Module a) Internal Components of Big Data-ARpMBig Data-Arp Mretrieves input from synchronous multiple data sources, these input are raw and however need to be pre-processed and further classified before its then stored securely and await a request for delivery, because the data may contain sensitive information of so many entities, people and organization, releasing the data without anonymizing them may be a very great disaster Big , Data-Arp Mhas a well-structured internal component that facilitates all the processes, the structures and their respective functions are;](image-4.png "Figure 1c :")
1![Preprocessor Algorithm Procedure Preprocessor (Record D){ //column screen While (D hasvalue){ // loop through each field of D If(!isValidField(D i )){ //check if each filed is not empty or is a valid data Return false } } // structure screen If(D.lenght<ExpectedFieldSize){ Return false } While (D hasFiled){ // loop through each field Title of D If(!ValidField(DT i )){ // check if the field is among expected filed Return](image-5.png "ALGORITHM 1 :")
![Key Generation module ? Key verification module ? Data key generation module All the modules rely on the RSA public key crypto system for the following ? RSA Encryption ? RSA Decryption Global Journal of Computer Science and Technology Volume XX Issue III Version I ( ) RSA Key Generation ? RSA Signing and ? RSA Verifications b) Parallel Processing and Distributed Storage module](image-6.png "?E?")
![](image-7.png "ALGORITHM")
22![Figure 2: Normalized Certainty Penalty with respect to K (Privacy Level)Figure2depicts the summation of normalized range of equivalence classes with high privacy (low value of k) having the higher normalized certainty penalty than those with low privacy. Even though, the normalized range of each equivalence classes in high privacy is small, the number of tuples in each equivalence group are high that their summation is larger than the normalized range of equivalence classes in lower privacy (high value of K).](image-8.png "Figure 2 :Figure 2")


Input: Incoming request.Output: Preserved outgoing dataBeginStep 1: Login credentials validated by access restrictionmodule ?True/FalseStep 2: If "True", request interface is displayed. AccessGranted.Otherwise, the user is an unauthorized user. AccessDenied.Step 3: If "Access Granted" ThenLevel ? call Request User Level();Send Request (req, level, res);dp? DP(res,level);Step 4: Is True (dp): process Result (dp?result):preserved Data (dp?result);Step 5: Output Request (dp?result);Step 6: otherwise, goto step 3.EndVI.
1Year 202070Volume XX Issue III Version I( ) EGlobal Journal of Computer Science and TechnologyEvaluation Metrics Data Utility Scalability Accuracy Sensitivity Specificity Processing TimeDifferential Privacy (DP) High Good 95.80% 93. 60% 98.00% 0.40 msK-Anonymity Low Poor 85.00% 80.00% 82.00% 0.45 ms© 2020 Global Journals
			© 2020 Global Journals
		
		
* 
	
		Preserving Privacy in Map Reduce Based Clouds: Insight into Frameworks and Approaches
		
			SAl-Aqeeli
		
		
			Alnifie
		
		
			G
		
		doi:10.110
	
	
		International Conference on Cloud Computing (ICCC)
				
			2015
		
	
* 
	
		Extracting value from chaos
		
			JGantz
		
		
			DReinsel
		
	
		IDC iview
		
			
			2011. 1142
		
	
* 
	
		Tarihi: 20.12
		
		Gartner 2014. IT Glossary, What is Big Data? URL
				
			2014
		
	
	Son Eri?im


* 
	
		The rise of "big data
		
			IA THashem
		
		
			IYaqoob
		
		
			NBAnuar
		
		
			SMokhtar
		
		
			AGani
		
		
			SUKhan
		
	
		Review and open research issues. Information
		
			47
			
			2015
		
	
* 
	
		Toward efficient and privacy-preserving computing in big data era
		
			RLu
		
		
			HZhu
		
		
			XLiu
		
		
			JKLiu
		
		
			JShao
		
		10.1109/MNET.2014.6863131
		
	
		IEEE Network
		
			28
			4
			
			2014
		
	
* 
	
		Big Data: The Next Frontier for Innovation, Competition, and Productivity
		
			JManyika
		
		
			MChui
		
		
			BBrown
		
		
			JBughin
		
		
			RDobbs
		
		
			CRoxburgh
		
		
			AHByers
		
		
			2011. May 2011
			McKinsey Global Institute
			Seattle
		
	
* 
	
		Protection of big data privacy
		
			AMehmood
		
		
			INatgunanathan
		
		
			YXiang
		
		
			GHua
		
		
			SGuo
		
	
		IEEE translations and content mining are permitted for academic research
				
			2016
			
		
* 
	
		A Privacy Preserving Framework for Big Data in E-Government
		
			Mdileas;Pramanik
		
		
			Lau
		
		
			YKRaymond
		
		
			WeiTYue
		
		
			2016. 2016
		
	
* 
	
		The-complete-beginners-guide-to-bigdata-in-2018
		
			RRay
		
		
			2018
		
	
* 
	
		Making Big Data, Privacy, and Anonymization Work Together in the Enterprise: Experiences and Issues
		
			JSedayao
		
		
			RBhardwaj
		
		
			NGorade
		
		10.1109/bigdata.congress.2014.92
	
	
		IEEE International Congress on Big Data
				
			2014
		
	
* 
	
		Big Data Privacy Based on Differential Privacya Hope for Big Data
		
			KMShrivastva
		
		
			MRizvi
		
		
			SSingh
		
		
			2014
			167
		
	
* 
	
		A Framework for Categorizing and Applying Privacy Preservation Techniques in Big Data Mining
		
			LXu
		
		
			CJiang
		
		
			YChen
		
		
			JWang
		
		
			YRen
		
	
		Computer
		
			49
			2
			
			2016
		
	
* 
	
		Deduplication on encrypted big data in cloud
		
			ZYan
		
		
			WDing
		
		
			XixunYu
		
		
			HZhu
		
		
			RHDeng
		
	
		IEEE Trans Big Data
		
			2
			2
			
			2016
		
	
* 
	
		Privacypreserving big data publishing
		
			HC CZakerdah
		
		
			2015
			ACM
			La Jolla
		
	
* 
	
		Cloud Safe: Storing Your Digital Asset in the Cloud-based Safe
		
			QZhang
		
		
			BLuo
		
		
			WShi
		
		
			AMAlmoharib
		
		
			Wayne State University
		
	
* 
	
		CloudSafe: Storing Your Digital Asset in the Cloudbased Safe
		
			QZhang
		
		
			BLuo
		
		
			WShi
		
		
			AMAlmoharib
		
		
			2013
			
		
			Wayne State University