Cost-effective internet of things privacy-aware data storage and real-time analysis

ABSTRACT


INTRODUCTION
In recent years, advances in wireless sensor networks (WSNs) has given birth to a new computing paradigm known as the internet of things (IoT) [1].IoT is currently gaining momentum and is one of the emerging 21 st -century technologies.It is used to enhance the connection of people and things at any time, any place, with anything and anyone, typically with the use of some path/network and any service.The importance of IoT is to build "a better world for human beings [2]," where the objects around us understand our wants, likes and needs.IoT eases the accessibility to information.Today, the Internet has created a link for the exchange of data, information's, opinions and news among over 100 countries [3].The primary drivers of IoT are large organizations and industries that greatly benefit from the predictability and foresight afforded by the ability to monitor all objects through the service chains in which they are embedded.The applications of IoT have many remarkable applications in our day to day lives, including smart cars, home appliances and security, health-tracking wearable devices and weather monitors.
With the advent of IoT, there has been a tremendous proliferation of smart devices and applications which generate massive data called "Big data" on a daily or weekly basis depending on the application.They include sensors, devices, social media, temperature sensors, and health care applications and so on.They constantly generate a huge amount of data characterized by structured, unstructured, or semi-structured [4] outputs which are deemed insufficient for the traditional databases in terms of storage, processing and analysis.However, this generated data is very useful to the organizations that own them, and data analysts are playing critical roles in improving their usefulness to further improve the growth of companies and to enhance decision making, day to day communication, relationship and building a good network among various customers.Currently, several organizations have benefited greatly from the development of IoT technology and large volumes of data generated are being assembled and transmitted from one device to another, device to business systems, and seldom from device to humans.
Inspite of the benefits accrued to IoT, handling this data has become a major challenge [5] and the technology is faced with several challenges which include security, privacy, scalability and so on.These challenges are in terms of storage, analysis and processing of large volumes of data emanating from numerous data resources or heterogeneous IoT devices [6].Moreover, these generated data are greatly prone to the risk of data theft, identity, manipulation of devices, falsification of data, and manipulation of server/network owing to inappropriate privacy models put in place in securing users' personal information.Therefore, to properly manage the high volume of data generated and to avoid the violation of data or misuses, proactive and data privacy-preserving measures must be taken to store, process and publish data to prevent breaches of sensitive information and other types of privacy and security incidents.With advancements in information and communications technology (ICT) in the past few decades [7], various computing models or paradigms such as cloud computing have come to the limelight.Cloud computing facilities are centred on the "data centre" procedure, where networks of hundreds of thousands of servers are assembled to provide services.
In addition to several dedicated servers positioned in data centres, there are also billions of seldom used personal computers (PCs) belonging to private owners and organizations worldwide, usually used for a few hours per day [8].Their massive unused compute and storage capabilities can be combined as a substitute cloud fabric for the provision of extensive cloud services and predominantly infrastructure services.Cloud computing (CC) is like a conveyer that carries information and data for various users and offers services that can be utilized at a low cost.Nowadays, the growth of large and small scale companies largely depend on their data, maintaining these data requires a lot of money and resources [9].Most of these organisations cannot afford the huge cost and maintenance of in-house built IT infrastructure and backup support services.Thus, cloud computing stands as a cheaper and best alternative to store their generated data due to data storing efficiency, low maintenance and computational cost which has attracted most individuals, organisations or even governments in recent years.CC plays a widespread measure of data accessibility where various users can store information via the cloud and pay to get it to reproduce for further use when needed [10].However, CC has its challenges which are enormous, and most consumers and establishments are uninformed about the third-party vulnerabilities of their stored data into the cloud.
Considering the above background and the nature of IoT generated data, the generated data should be managed properly using cost-effective storage, processed, and analysed in real-time and personal information kept secure.To strive to achieve the stated instances is, therefore, the intention of this research.IoT devices generate a high volume of data on a daily or weekly basis and thus, handling this data has become a major challenge.The data generated require huge storage space and real-time data analysis for dynamic decision making.The data are characterised by structured, unstructured, or semi-structured [11] information which is considered insufficient for the traditional databases in terms of storage, processing and analysis.That is, the data contains useful and meaningful hidden information whose behavioural patterns are very hard to detect.Thus, providing appropriate storage architecture to store the generated data and algorithm for real-time data analysis is highly important to discover the hidden knowledge and aid dynamic decision making.
Moreover, the data generated contains important personal information of users and this information is not protected.Such data can easily be collected, and personal information exploited to endanger the privacy of the owners.As data has become a valuable asset used in promoting businesses and an effective source of decision making [12], security breaches, data leakage and cybercrime have also risen sharply globally due to ubiquitous modes of access.For IoT generated data, the intuition is that, though data cannot be completely secured, the privacy of the data owners should always be protected.Though several privacy models and security approaches used to protect data from unauthorized access exist, each has its strengths and limitations which can easily be exploited.In particular, "k-anonymity (KA) fails to prevent the background knowledge and homogeneity attacks, suffers from attribute linkage and record linkage and long processing time [13], l-diversity is prone to skewness and similarity attacks while t-closeness (TC) loses the correlation between changed attributes since each attribute is generalised separately.In this case, the data utility is damaged when it is very small.Lastly, in differential privacy, data utility may be reduced, a data miner is only allowed to pose aggregate queries and the probability of attacking both the database by an adversary is not taken into account".Consequently, there is the need for a secured and effective privacy model to protect personal information in published data.This paper, therefore, uses data privacy model which is the combined cost-effective storage

249
architecture for data management and data privacy.The proposed model assists in ensuring that the voluminous data is effectively managed, ubiquitously accessed, and personal information is well-protected.

RELATED WORKS
Different works have been done in literature on internet of things and data privacy.Table 1 shows the summary of existing work with their remarks, solutions, models used, attacks and data utility.From our studies and previous research, it is evident that the differential data privacy model has proven to be more secure.Additionally, information loss was observed across the four data privacy models utilized in this investigation, but differential data privacy model outperformed the others.

Table 1. Summary of related works on privacy models
Ref.
Remarks Solutions Models Used Attacks Data Utility [14] The study Proposed the use of multiple differential privacy model, on real-time analysis.
The approach helps to offer better and stronger data privacy protection.From the study, it was noted from the results that CPU consumption, RAM usage and lastly information loss was reduced.

SMR model Randomization Perturbation
N/A Information loss [16] EHRs system is prone to privacy violations, especially when stored in healthcare medical servers.
This study provides a discussion on several anonymity techniques designed for preserving the privacy of microdata TC, LD and KA N/A Information loss [17] From the study data utility was little and the model cannot be recommended in many areas.
From the study, a new novel model of protecting data was presented.

Slicing model KA, and Anatomy model
Skewness attack, Sensitivity attack, Similarity attack Information loss [18] The study presented a personalized approach or method of preserving the data using (α, ω)-anonymity model.Exploring the use of QI attribute and sensitive attribute.
From the study, the core solution provided was that privacy is based on the measure from the individuals' needs and requests and this was fully achieved in the study.

Similarity attack
Information loss [19] The study showed the various data privacy and security issues and possible solutions.
Homomorphic Information loss [22] The proposed method has helped in decreasing the average re-identification risks between 100% and 2.33%.
The study result shows that reidentification risks are far less ranging from 100% to 2.33%.
δ-Presence, TC, LD, and KA Background knowledge attack and similarity attack Information loss [23] The need to apply suitable privacy models to the published data becomes very necessary.
Semantic anonymization approach methods were proposed.

KA, LD Background knowledge attack
Decreases the data utility [24] The need to use micro aggregation leading to adding and deleting some of the data and records is updated.

METHOD
The method of data collection was a secondary data approach, with datasets being analysed, respectively.The researchers adopted qualitative research methods.Qualitative research is used to understand and explain phenomena on how to better interpret the data.Also, an in-depth literature review was carried in other to identify the problem under study and to have a better background knowledge of the data to analyse and a better approach in solving the identify problem.Quantitative research deals with the numerical analysis of collected data for decision making.In this research quantitative data were collected from my empirical analysis and simulations.

Tools and technologies
Three different software packages were used in our analysis.The three were as follows.ARX opensource software, Orange3 open-source software and iFogSim open-source software, these three tools help in the presenting of our research results/findings in a more meaningful way.ARX open-source software was used for the experiment and the software supports the transformation of the dataset in a way that ensures the data conforms to user-specific privacy models and risk thresholds that hinder attacks that may result in privacy breaches.ARX can be utilized to eliminate direct identifiers (e.g., names) from datasets and to put additional restrictions on indirect identifiers.Indirect identifiers (or quasi-identifiers, or keys) are attributes that do not directly classify a person but may combine with other indirect identifiers to produce an identifier that can be utilised for connection attacks.There is a usual assumption that data identifiers are accessible to a third party (in some form of background knowledge), and it is difficult for them to be removed from the dataset (e.g., because they are required later for analyses).Lastly, the ARX software supports methods for the protection of sensitive attributes and sensitive disclosure attacks using and semantic privacy models [25].
It is an open-source software implemented in python and C++ Programming languages.It is a visual programming front-end for explorative information examination and perception.It underpins documents in .csv.It is a segment based visual programming for information mining, ML, and information investigation.Its parts are called gadgets and range from information perception subset choice and pre-preparing to exact assessment of learning calculations and prescient displaying.

iFogSim software
iFogSim is an open-source software that was used in performing the simulation.iFogSim has different types of physical entities such as device or node, sensor, and actuator.The logical entities used in modelling applications include AppModule models used for IoT services, the AppEdge model for data dependency among services, and the Tuple models which oversee entities communication.The simulations and results are presented in result session.

Data privacy and analytics mode choice
This chapter is aimed at selecting the best performing data privacy model.ARX software was used to analyse the data.ARX software provides a platform where the data privacy models can be used and to test the performance and evaluate the data.For effective proof of concepts, this research used IoT data generated from healthcare as a case study.This is because about 60% of the global healthcare organizations have incorporated IoT technology into their daily use to better enhance the overall healthcare working environment.These IoT devices are effective in helping healthcare practitioners and patients to monitor, track, trace medical reports of patients, analyse the hospital details, record patient's health status in a consistent manner which would otherwise be difficult for physicians alone to do.Accordingly, this greatly reduces the cost of healthcare and helps to minimise the chances of errors in patients' health records.
Moreover, for the data privacy model, different data privacy models were employed based on existing models such as k-anonymity (KA), l-diversity (LD), t-closeness (TC), and differential privacy (DP), for the test.For cost-effective data storage, the fog and the cloud data centres were used while empirical analysis of some ML algorithms was conducted to select the best performing algorithm for usage in the real-time data analysis to help in effective and reliable decision making in terms of classification accuracy and time efficiency.The idea is to automate the building of a data analytics model that uses the algorithm to learn from data interactively.By choosing the best model, decision making can be improved over time with less human intervention this is as shown in Table 2.
In this research implementation, the data analysis was performed qualtitavivley conducted on the collected data using defined metrics in  The first stage of the re-identification risks model and measures the thresholds of the attacker model and provides a record of risks, the highest risks level and the success rate of the anonymization process.

Journalist Attacker Method
The second stage of the re-identification risks model and measures the thresholds of the attacker model and provides a record of risks, the highest risks level and the success rate of the anonymization.

Markerter Attacker Method
The final stage of the re-identification risks model measures the thresholds of the attacker model and provides a record of risks, the highest risks level and the success rate of the anonymization.

True positive (TP)
The classification model correctly classified risky class as truly risky.

True negative (TN)
The classification model correctly classified as not a risky class as truly not risky.

False-positive (FP)
The classification model incorrectly classified a risky class as not risky False-negative (FN) The classification model incorrectly classified a not risky class as risky.

RESULTS AND DISCUSSION
This subsection presents the results of the analysis for the four selected data privacy models.The analysis was performed using defined metrics in Table 3. Accordingly, Table 4 shows the results of the AUC, Brier skill score and risk analysis of the KA, LD, TC and DP data privacy models.The results show that ARX performed substantial extensive measurements and attacks were predicted from the four attributes of Id, age, gender, and income from a diabetic dataset.

BSS
Table 4 shows the relative accuracy of the anonymization model where BSS achieved 0.00037 for KA, 0.00931 for LD, -0.43760 for TC and 0.0506 for DP.The BSS ranges between -0.43760 and 0.0506.The indication is that all the models provided a high degree of protection for the given dataset or record.However, based on the results, TC is not recommended due to its inability to handle large scale datasets as seen from the literature.The resulting privacy-preserving models of KA, LD and DP exhibited high protection power.Accordingly, from all the values obtained, the DP privacy model performed better in terms of accuracy with a value of 0.0506 obtained for its BSS which was closest to 1, this suggests that the DP model performed better in terms of accuracy.To obtain a more efficient privacy model, DP can be combined with KA [13].

Receiver operating characteristics curves
In the context of the experiment conducted, the data privacy models trained on unmodified data attained a ROC AUC of about 53.61% for KA, 50.11% for LD, 46.62% for TC and 43.73% for DP.Compared to the initial performance the relative ROC AUC was between 45.73% and 53.61%.This is shown in Figures 1-4 for each of the models considered in this study.Accordingly, KA appears to be the best performing with 53.61% which was the highest accuracy obtained.The implication is that KA offers effective data protection in terms of anonymization than the other models considered.Thus, KA can be combined with DP to form a hybrid model that can offer a high degree of protection.This is because DP is the most accurate in terms of the BSS and re-identification risk while KA has a good threshold in terms of the ROC AUC.Thus, combining the two privacy models could go a long way to offer a high degree and more efficient privacy protection.

Re-identification risk
This summarizes the risks of all records in a dataset in terms of each possible risk level and the number of affected records is shown in Table 5.Based on the experiment conducted, the re-identification risk obtained was 0.52125 for KA, 0.16084 for LD, 0.10866 for TC and 0.08065 for DP.The summary is shown in Table 5.In Table 5, the re-identification risks value for the DP privacy model is 0.08065.DP value is smaller than values obtained for other models, signifying its suitability in protecting the privacy of our data.As shown in Accordingly, LD had 0.349 as the highest risk value for the three attack models while the success rate for the models was 0.160.The success rate of 0.160 is lower than that of the KA's attack models suggesting the LD privacy model can provide more efficient data privacy when compared to KA.However, the pitfall of LD is that it is subject to both skewness and similarity attacks, cannot prevent attribute disclosure and is susceptible to both homogeneity and background knowledge attacks.Moreover, the 3 attack models have a value of 0 for TC record at risks, the highest risk of 0.338 and a success rate of 0.100.The low success rate of 0.100 obtained suggests that using the TC privacy model on the anonymized data would provide a more efficient privacy mode when compared to KA and LD.However, TC is limited by the fact that, as the size and variety of the data increases, the chances of re-identification of data also increase.
In the same vein, the record at risk for DP for the 3 attack models is 0 while the highest risks for the prosecutor attacker model and the journalist attacker model are 0.179 and 0.153 respectively with a success rate of 0.094, 0.081 and 0.081 respectively for the 3 attack models.The low success rate value achieved indicates that using the DP privacy model to anonymize data would provide a more efficient privacy mode when compared to KA, LD, and TC.Thus, DP could be the most suitable model and most appropriate for preserving IoT data.The essence is that DP does not allow the degradation of the system's speed compared to other models.Privacy is preserved by making it cumbersome for an attacker to deduce any person involved regardless of the attack knowing the precise information of all the persons present in the dataset.Based on the result, one can see that combination of DP and KA can provide a more stronger data privacy model that can be used to secure the data.This is because they can offer more efficient privacy as seen from their re-identification risk, BSS for DP, and AUC ROC analysis for KA [13], as shown in Table 5.

CONCLUSION
Conclusively, the combination of differential privacy and k-annymity as showed in our results to protect the data more, the two data privacy model algorithms (DP and KA) which were used to design a hybrid privacy model proposed in this paper provide a stronger data privacy model which therefore enhance the protection of the personal information of users.It is recommended that a novel data privacy model should be developed that can do both the real-time analysis and as well protect the data from attack in any form.Furthermore, it is suggested that more of the currently used data privacy model be combined to see what effect it would have on the dataset protection and to see if information loss is reduced.

Int
internet of things privacy-aware data storage and real-time … (Femi AbiodunElegbeleye)

Table 3 .
These experimental metrics are used to show the performance of the proposed model on the collected data.The detailed experiments utilizing these metrics are in sections 4.1.to 4.3.respectively.
Cost-effective internet of things privacy-aware data storage and real-time … (Femi AbiodunElegbeleye)251

Table 2 .
Considered privacy models Model Motivation KA Implementation is easy and fewer chances of data identification.LD It summarizes data and prevents data attribute disclosure.TC it promotes sensitive value variation with a group, disclosure of attributes and skewness attacks prevention.DP Most effective privacy model, add noise without loss of information and minimize data utility.

Table 3 .
Data privacy parameters

Table 4 .
Summary of BSS, AUC, and risk analysis

Table 5
also are records of risk for the prosecutor attacker model, journalist attacker model and marketer Accordingly, KA has value 0 as the highest risk value for the prosecutor attacker model, while both journalist attacker model and marketer attacker models have 20 while the success rate is 0.521.The success rate of 0.521 is the highest value obtained with the indication that using the KA privacy model to anonymize data makes it vulnerable to the attacker.Thus, KA cannot provide efficient privacy for the data, and this corroborates with what is in the literature that KA fails to prevent background knowledge.KA is vulnerable to matching, temporal, homogeneity, and complementary release attack.
Cost-effective internet of things privacy-aware data storage and real-time … (Femi Abiodun Elegbeleye) 253 attacker model.

Table 5 .
Attacker method risk analysis