03_privacyPreservingDataScience_Differential_Privacy

%md

###### by [Christoffer Långström](https://www.linkedin.com/in/christoffer-l%C3%A5ngstr%C3%B6m-7a460514b/)

This is part 3 of the in a series of notebooks discussing different aspects on privacy preserving methods in data science. 

Differential privacy, originally proposed by Cynthia Dwork et. al (2006) treats the problem of privacy from the perspective that simply being included in a data set is a risk. The solution proposed through diffential privacy is to make it so that a data set that contains a certain individual is not meaningfully differentiable from the same data set with that person removed. 

The proposal was to make it so that a random algorithm could be considered epsilon diff. private if the distribution of the estimator does not "change much" with the inclusion/exclusion of any particular individual, with epsilon being the "privacy factor" that quantifies to what extent the distributions differ. See below for more details. 

The advantage of this approach is that we can now quantify the degree of privacy of a random algorithm, and start to develop methods that enforce a desired level of privacy in an algorithm. Further, there is research being done (Duchi et. al) on *local* differential privacy, where the true data is kept hidden from even the data scientist working on it, by applying differentially private mechanisms to the data, before any statistics is computed. The type of privacy preserving mechanism that is applied depends on the application, where min/max optimal estimators are found for several canonical statistical problems. 

### Links:

* Introduction to DP in the context of network protocols:  
[Diffential Privacy at IETF104 by Amelia Andersdotter and Christoffer Långström](https://www.youtube.com/watch?v=PlSX-o36Wus&feature=youtu.be&t=3809)

* Locally private statistical procedures:  
[Minimax Optimal Procedures for Locally Private Estimation](https://arxiv.org/abs/1604.02390)

* Original DP paper:  
[Cynthia Dwork et. al](http://people.csail.mit.edu/asmith/PS/sensitivity-tcc-final.pdf)

by Christoffer Långström

This is part 3 of the in a series of notebooks discussing different aspects on privacy preserving methods in data science.

Differential privacy, originally proposed by Cynthia Dwork et. al (2006) treats the problem of privacy from the perspective that simply being included in a data set is a risk. The solution proposed through diffential privacy is to make it so that a data set that contains a certain individual is not meaningfully differentiable from the same data set with that person removed.

The proposal was to make it so that a random algorithm could be considered epsilon diff. private if the distribution of the estimator does not "change much" with the inclusion/exclusion of any particular individual, with epsilon being the "privacy factor" that quantifies to what extent the distributions differ. See below for more details.

The advantage of this approach is that we can now quantify the degree of privacy of a random algorithm, and start to develop methods that enforce a desired level of privacy in an algorithm. Further, there is research being done (Duchi et. al) on local differential privacy, where the true data is kept hidden from even the data scientist working on it, by applying differentially private mechanisms to the data, before any statistics is computed. The type of privacy preserving mechanism that is applied depends on the application, where min/max optimal estimators are found for several canonical statistical problems.

Links:

Introduction to DP in the context of network protocols:
Diffential Privacy at IETF104 by Amelia Andersdotter and Christoffer Långström
Locally private statistical procedures:
Minimax Optimal Procedures for Locally Private Estimation
Original DP paper:
Cynthia Dwork et. al

Show code

One of the first suggested approaches to enforcing e-differential privacy was by introducing "noise" to the ouput of estimators. This approach works as thus: For a dataset X, compute the statistic T(X), generate some noise W, return T(X) + W as the result. If W is symmetric noise, and chosen in such a way that it does not greatly distort the data, the goal is to achieve reasonably good estimates that cannot be meaningfully reversed engineered to reveal information on the people included in the data set.

The Laplace Mechanism

This approach to adding noisy data involves adding samples from a standard laplace distribution, with the scale parameter chosen depending on the sensitivity of the data and the desired privacy level. The laplace distribution is a symmetric "double exponential" distribution with slim tails and a sharp center, here taken to be at the origin. The algorithm for applying laplace noise to a data set is as follows:

Let D be a data set and T a statistic to be computed, and epsilon the desired degree of privacy.

Compute the sensitivity $\Delta(T(D))$ of the statistic T to changes in the data set D
Generate noise W = $Laplace(\Delta(T(D))/ \epsilon)$
Compute & return the result $T'(D) = T(D) + W$

The statistic T' is now an $\epsilon$ - differentially private version of T, as defined above.

Some notes:

For small values of $\epsilon$ (i.e high degrees of privacy) significant levels of noise is added and so the values of T' will exhibit large variance for repeated queries and may provide unaccetably poor performance in most applications.
This approach only protects against information being divulged from the output of T by an adversiarl party, the data still needs to be stored safely by a trusted holder. This is the problem adressed by local DP, discussed below.

%r 
#  Generate a data set
n = 200
heights = rnorm(n, 180, 5)

# Compute Sensitivity: 
#-------------------------------------------------------
f = rep(0, n)
for(i in 2:n-1){
    a = 1:(i-1)
    b = (i+1):n
    heights2 = c(heights[a], heights[b]) 
    f[i] = abs(mean(heights) - mean(heights2)) 
}

sensitivity = max(f)
print(max(f))
#-------------------------------------------------------

# Add noise: 
#-------------------------------------------------------
install.packages("rmutil")
install.packages("ggplot2")
require(rmutil)
require(ggplot2)

epsilon1 = 1
epsilon2 = 0.25
mse1 = rep(0,n)
mse2 = rep(0,n)

for(i in 1:n){
  dat = heights[1:i]
  out1 = mean(dat) + rlaplace(1, 0, sensitivity/epsilon1)
  out2 = mean(dat) + rlaplace(1, 0, sensitivity/epsilon2)
  mse1[i] = (180 - out1)^2
  mse2[i] = (180 - out2)^2
}

# Plot the results
x = 1:n
df = data.frame(x, mse1, mse2)

p = ggplot(df, aes(x)) +                    # basic graphical object
  geom_line(aes(y=mse1), colour= "blue") +  # first layer
  geom_line(aes(y=mse2), colour= "orange") # second layer
p + labs(title = "Effect of Laplace Noise on MSE(Sample Size)") + xlab("Sample Size") + ylab("MSE")


#-------------------------------------------------------

%md 
#### Experiment setup: 
As an example application, [Duchi et. al](https://arxiv.org/abs/1604.02390) considers the problem of estimating means of drug use related hospitalizations. In the original problem, they have records of hospitalization of the form (drugs, year, # of hospitalizations) where the number of hospitalizations denotes how many people that year where admitted and used a certain drug. Many of the admitted individuals used several drugs, as the number of drugs were vastly larger than the number of admittances. We will use this information to generate a data set of measurements of the form  \\( X_i \in {0,1}^d \\) representing one admission; a component *j* is 1 if the patient used drug *j* and 0 else. The usage frequency will be uniformly distributed over the samples generated, with only the constraint 
that the total number of admittances using drug *j* matches a preset number. This means that if there were *n* patients and *k* of them used drug *j* then \\( X_ij = 1 \\) with probability *k/n*. 


#### Methods: 
We have N = 1000 measurements of \\( X_i \in {0,1}^d \\) for *d = 27*, with a preset vector of admittance frequency. The problem is to estimate the mean of this population, \\( \theta = E(X) \\) and we will use three different approaches: 
* Non-private estimation
* Min/Max optimal local privacy mechanism
* Local Laplace Noise 

Non-private estimation is simply the sample mean of the raw data, which is the "gold standard" of estimates that private methods are necesserally worse than. The min/max locally private mechanism consists of generating a new sample from the raw data which is sequentially dependent, in such a way that the new sample is differntially private and min/max optimal w.r.t to an arbitray loss function. This can be compared by the theoretical lower bound obtained in the paper. Lastly, one can compare this to a "naive" local privacy mechanism of adding *laplace(epsilon/d)* distributed noise to the data before estimation. Each method is analyzed through the l-infinity norm of the deviation from the full sample mean, as a function of the sample size.

Experiment setup:

As an example application, Duchi et. al considers the problem of estimating means of drug use related hospitalizations. In the original problem, they have records of hospitalization of the form (drugs, year, # of hospitalizations) where the number of hospitalizations denotes how many people that year where admitted and used a certain drug. Many of the admitted individuals used several drugs, as the number of drugs were vastly larger than the number of admittances. We will use this information to generate a data set of measurements of the form $X_i \in {0,1}^d$ representing one admission; a component j is 1 if the patient used drug j and 0 else. The usage frequency will be uniformly distributed over the samples generated, with only the constraint that the total number of admittances using drug j matches a preset number. This means that if there were n patients and k of them used drug j then $X_ij = 1$ with probability k/n.

Methods:

We have N = 1000 measurements of $X_i \in {0,1}^d$ for d = 27, with a preset vector of admittance frequency. The problem is to estimate the mean of this population, $\theta = E(X)$ and we will use three different approaches:

Non-private estimation
Min/Max optimal local privacy mechanism
Local Laplace Noise

Non-private estimation is simply the sample mean of the raw data, which is the "gold standard" of estimates that private methods are necesserally worse than. The min/max locally private mechanism consists of generating a new sample from the raw data which is sequentially dependent, in such a way that the new sample is differntially private and min/max optimal w.r.t to an arbitray loss function. This can be compared by the theoretical lower bound obtained in the paper. Lastly, one can compare this to a "naive" local privacy mechanism of adding laplace(epsilon/d) distributed noise to the data before estimation. Each method is analyzed through the l-infinity norm of the deviation from the full sample mean, as a function of the sample size.

%r

#l-infinity sampling based privacy mechanism, as suggested by
l_inf_samp = function(X, alpha){

  n = length(X[,1])
  d = length(X[1,])
  samples = matrix(c(rep(0, n*d)), ncol = d, nrow = n)
  sample = rep(0,d)
  for(j in 1:n){
    x = X[j,]
    r = max(abs(x)) 
    x_hat = rep(0,d)

    for(i in 1:d){
      p = 0.5 + x[i]/(2*(r+(r==0)))
      y = rbinom(1, 1, p)
      x_hat[i] = (y==1)*r - (y==0)*r
    }
    
    A = 2^(d-1)
    
    if(d%%2 == 1){
      C = (1/A)*choose(d-1, 0.5*(d-1))
    } else{
      C = 1/(A+0.5*choose(d, 0.5*d))*choose(d-1, 0.5*d)
    }
    
    B = r*((exp(alpha)+1)/(exp(alpha)-1))*(1/C)
    
    pi_alph = exp(alpha)/(exp(alpha) + 1)
    T_alph = rbinom(1, 1, pi_alph)
    accept = FALSE
    
    while(!accept){
      z = runif(d, -B, B)
      dot = sum(z*x_hat)
      if((dot >= 0)&&(T_alph == 1)){
        sample = z
        accept = TRUE
      }
      else if((dot <= 0)&&(T_alph == 0)){
        sample = z 
        accept = TRUE
      }
    }
    samples[j,] = sample
  }
  return(samples)
}


my_mean = function(X){
  v_sum = rep(0,d)
  N = length(X[,1])
  for(i in 1:N){
    v_sum = v_sum + X[i,]  
  }
  return(v_sum/N)
}


gen_data = function(N,d,prop){
  X = matrix(c(rep(0,d*N)), ncol=d, nrow=N)
  for(i in 1:d){
    count = 0
    while(count < prop[i]){
      idx = floor(runif(1,1,N))
      if(X[idx,i] == 0){
        X[idx,i] = 1
        count = count + 1
      }
    }
  }
  return(X)
}

%r
library(rmutil)
library(ggplot2)

# Set Parameters 
d = 27
alpha = 0.5

N_max = 1000
N_low = 100

# Initialize storage
error1 = rep(0, N_max-N_low)
error2 = rep(0, N_max-N_low)
error3 = rep(0, N_max-N_low)
low_bound = rep(0, N_max-N_low)

# Generate Data 
proportions = c(6, 2, 4, 2, 1, 1, 2, 6, 5, 4, 1, 9, 2, 6, 2, 1, 6, 2, 8, 5, 2, 8, 7, 5, 6, 4, 2)*N_max/15
X = gen_data(N_max, d, proportions)
theta_true = my_mean(X)

for(i in N_low:N_max){
  N = i
  
  X_n = X[1:i,]
  Z = l_inf_samp(X_n, alpha)
  
  W = matrix(rlaplace(N*d, 0, d/alpha), ncol = d, nrow = N)
  
  theta1 = my_mean(X_n)
  theta2 = my_mean(Z)
  theta3 = my_mean(X_n + W)
  
  error1[i-N_low+1] = max(abs(theta_true - theta1))
  error2[i-N_low+1] = max(abs(theta_true - theta2)) 
  error3[i-N_low+1] = max(abs(theta_true - theta3)) 

  a = d*log(2*d)
  b = (exp(alpha)-1)^2
  low_bound[i-N_low+1]  = sqrt(a/(N*b))
}

# Plot the results
x = N_low:(N_max)
df = data.frame(x, error2, error3)

p = ggplot(df, aes(x)) +                    # basic graphical object
  geom_line(aes(y=error1), colour= "red") +  # Non-private estimate
  geom_line(aes(y=error2), colour= "blue") + # Locally Private estimate
  geom_line(aes(y=error3), colour= "orange") + # Laplace Mechanism
  geom_line(aes(y=low_bound), colour= "black")  # Min/Max optimal bound

p +  ggtitle("L-inf Error of different estimation methods as a function of sample size") +
  xlab("Sample Size") + ylab("l-inf Error") + labs()

SDS-2.x, Scalable Data Engineering Science