Secure and Privacy Preserving Data Mining Techniques

Since the process of data mining has many steps, there are different privacy preservation schemes and paradigms on each level of data mining. The main guiding idea for the PPDM is inspired by Zhang and Zhao [4]. The main layers that are considered in this chapter include the DCL and DPL. The DPL can also be considered as a sublayer. The final layer is the data mining output layer. All these layers are visible in Figure 7.1.

A PPDM framework

FIGURE 7.1 A PPDM framework.

At the first layer, data is collected from a huge number of resources. This is the raw, original data and contains users' private information. At the time of data collection, the privacy preserving techniques are utilized. This data is stored in data warehouses in the next layer before processing.

In the second layer (DPL), there are a number of data warehouse servers. Here, the raw data is stored and aggregated. This data may be aggregated as sum, average, or pre-computed using a method to ensure confidentiality. This makes the data aggregation and fusion process faster. The third layer is the data mining output layer (DML). Here, the results are obtained using some data mining algorithm. The job is to keep the data mining methods efficient while also incorporating security and privacy preservation. There can be other forms of data mining as well, such as collaborative data mining. Here, the database is shared, hence it is required to make sure that data sets owned by multiple parties do not reveal each other’s information.

Privacy Preserving Techniques at Data Collection Layer

At this layer, there is a security and privacy concern if the sensitive data is collected by an untrustworthy collector. To prevent this, randomization is employed, which transforms the original data. The original data in its raw form is not stored as it will not be used further. Randomization can be described using the following equation.

Here, X = the original raw data. Y = noise distribution (for example, Laplace noise or Gaussian noise). Then Z = the result of the randomization of X and Y.

To reconstruct the original distribution X, we need an estimate of entity Z. The distribution of entity Z can be estimated from the samples Z1,Z2,...ZN, where n is the total number in the sample. Then with the noise distribution Y, X can be reconstructed, as shown in Equation 7.2. Additive Noise

In this method, the data randomization takes place by including noise with a known measurable dispersion. It is performed separately for each captured data point. It preserves the measurable properties after recreation of the original arrangement. Its usage is limited to aggregate distributions only. When extreme values (such as outliers) are to be masked, a great quantity of noise is required, which severely degrades the data utility. Multiplicative Noise

In this scheme, the privacy of data is obtained in the form of the product of commotion data and a given data value. In this, the recreation of the first individual data value is tougher than in additive noise, thus making it a more secure privacy mechanism. It preserves the measurable properties after recreation of the original arrangement.

Privacy Preserving at Data Publishing Layer

Entities may seek to publish their data publicly for further research or analysis. However, malicious adversaries may attempt to de-anonymize or target record owners for malevolent purposes. There are several methods and privacy models for data anonymization. A sanitizing operation is applied on the data with the aim of preserving the record owner’s identity. Some of these operations are: Generalization

This method works for both categorical and numerical data [5-9]. Here, the actual value is substituted by a parent value, which is a general value. For example, if the total number of students in a class is 60, the value can be specified to be falling in an interval. For categorical data, a hierarchy is defined. Suppression

This method is applied to a data set where either the column or row values are suppressed. The suppression is done by removal of some attributes. This helps in preventing the disclosure of any important information. Anatomization

In many databases there are quasi-identifiers (QIDs). These are pseudo identifiers for a user. QIDs are not sensitive like usual IDs. They can be formed by one attribute or a set of attributes. In anatomization, the QIDs and sensitive attributes are divides into two separate relations. This helps in de-association of data from the QID. This does not change the value of the data.

These are the most commonly used schemes in the DPL. Based on these, the following data privacy models are defined: K-Anonymity

K-anonymity uses generalization and suppression as its sanitization methods. It was proposed by Samarati and Sweeny [10, 11]. Anonymity is ensured by the presence of к-1 undistinguishable records for each record in a database. This set of к records is known as the identity grade. The attacker cannot identify a single record к and attack as there are similar k-1 records. It is a simple algorithm with a large amount of work done on the existing algorithms. The algorithm works best when the value of к is high. It assumes that each class element represents a distinctive value. The code characteristics play no role in anonymization, which can disclose information, especially if all records in a class have the same value for the sensitive attribute. There are other consequences of not taking sensitive attributes into consideration, such as de-anonymization of an entry, when a QID is associated with knowledge of the attribute and the database. Its application domain is wireless sensor networks [12], location-based services [13, 14], the cloud [15], and e-health [16]. L-Diversity

L-diversity builds upon the k-anonymity model by keeping a pre-requisite that each equal or comparable group must have L “well-represented” value for the sensitive attributes, hence it has similar generalization and suppression as its sanitization methods. For example:

Here, t = sensitive attribute (T) possible values and R(Id) = fraction of records forming one Id equivalence group, which have the value t for T attribute.

Entropy l-diversity can be stretched out to different characteristics. These sensitive attributes anatomize the data [16]. The variation in confidential data is also deliberated upon while anonymizing. This can be understood by an example similar to that proposed by Fung et al. [17]. Consider we have a data set where 93% of the entries have health insurance and 7% do not. An attacker seeks to find the group which does not have insurance, and has the original sensitive attribute knowledge. We will form its 1-diverse group. The maximum entropy within the group will be achieved when 50% of the group have health insurance entries and the remaining 50% do not. Its application domain is in e-health [16] and location-based services [14, 18, 19]. Personalized Privacy

Personalized privacy is accomplished by making a scientific categorization tree utilizing generalization and also permitting the record proprietors to characterize a guarding node. Proprietors’ security is penetrated if a malicious user is permitted to gather any secret incentive from the subtree of the guarding node with a likelihood (break likelihood) more prominent than a specific limit. In this, the proprietor can characterize their protection level. It maintains most utility while considering individual protection. However, it is hard to implement in practice. This is because it might be tough to approach and get records from owners. Also, there can be a tendency to overprotect data through general guarding nodes only. Its application domain is social networks and location-based services. Differential Privacy

All algorithms in differential privacy rely on background knowledge. There are other algorithms which do not follow this paradigm, for example, the differential privacy paradigm which achieves data privacy through data perturbation. In these methods, a minuscule amount of noise is summed with the true data. Actual data is thus masked from the adversaries. The core idea of differential privacy is that addition or removal of one record from the database does not reveal any information to an adversary. The step-by-step working of the core idea of differential privacy is illustrated in Figure.7.2.

This means that your presence or absence in the database does not reveal or leak any information from the database. This achieves a strong sense of privacy.

Differential privacy mechanisms

FIGURE 7.2 Differential privacy mechanisms. ∈-Differential Privacy

e-differential privacy ensures that a solitary record does not significantly influence (adjustable through value e) the result of the investigation of the data set. In this sense, an individual’s protection will not be influenced by partaking in the information assortment, since it will not have a huge effect on the ultimate result. It ensures proper security and a strong protection loss metric. It also ensures that the participation of a single individual does not lead to a privacy breach greater than that obtained from the non-participation of the same individual. There is no experimental guide on setting it. as it strongly depends on the data set. Privacy guarantees can require heavy data perturbation for numerical data, leading to non-useful output. Its application domain includes e-health, smart meters [20], and location-based services, e-differential privacy uses a randomized mechanism A(x) for two databases, D, and D2. that differ on at most one element, all output S range (A),

t is privacy parameter called privacy budget or privacy level.

< Prev   CONTENTS   Source   Next >