Essay: Data mining

Chapter 1
INTRODUCTION
1.1 INTRODUCTION
Many techniques in data mining and machine learning follow a gradient-descent paradigm in the iterative process of discovering a target functions or decision model. For instance, neural networks generally perform a series of iterations to converge the weight coefficients of edges in the network; thus, settling into a decision model. Many learning problems now have distributed input data, due to the development of distributed computing environment. In such distributed scenarios, privacy concerns often become a big concern. For example, if medical researchers want to apply machine learning to study health care problems, they need to collect the raw data from hospitals and the follow-up information from patients. Then, the privacy of the patients must be protected, according to the privacy rules in Health Insurance Portability and Accountability Act (HIPAA) [1], which establishes the regulations for the use and disclosure of Protected Health Information. Why the researchers would want to build a learning model (e.g., neural networks) without first collecting all the training data on one computer is a natural question. If there is a learner trusted by all the data holders, then the trusted learner can accumulate data first and build a learning model. However, in many real-world cases, it is rather difficult to find such a trusted learner, since some data holders will always have concerns like ‘What will you do to my data’? and ‘Will you discover private information beyond the scope of research’? On the other hand, given the distributed and networked computing environments at present, alliances will greatly benefit the scientific advances [2].
The researchers have the interest to obtain the result of cooperative learning even before they see the data from other parties. As a concrete example, the progress in neuroscience could be boosted by making links between data from labs around the world, but some researchers are reluctant to release their data to be exploited by others because of privacy and security concerns.
1.2 REPORT ORGANIZATION
This thesis describes the design and development of the Privacy Preserving Gradient Descent Method Applied for Neural Network with Distributed Datasets, as well as the procedures used to test their effectiveness. The report is organized as follows:
Chapter 2: Literature survey
This chapter focuses on review of different related work as well as emphasizing on privacy preserving distributed datasets for solving classification problems.
Chapter 3: Problem definition
This chapter describes the problem definition and goals of proposed system.
Chapter 4: System Design and Implementation
This chapter focuses on design of system, architecture of the system and privacy preserving gradient descent method used for implementation of the system. Also, it focuses on metrics used for evaluation of the system.
Chapter 5: Experimental results and Analysis
This chapter describes the experimental results obtained from the system. Also, it focuses on the analysis of results as well as comparison of performance with other system.
Chapter 6: Conclusion
This chapter summarizes the method used for privacy preserving for owners of distributed dataset applied for neural network. Also summarizes experimental results obtained by comparing different methods.

Chapter 2
LITERATURE SURVEY
This chapter focuses on review of different related work and emphasizing on different techniques for preserving privacy of distributed datasets.
2.1 Present Theory and Practices
D. Agrawal and R. Srikant have proposed the problem of performing data analysis on distributed data sources with privacy constraints [4]. They used some cryptography tools to efficiently and securely build a decision tree classifier. A good number of data mining tasks have been studied with the consideration of privacy protection, for example, classification [5], and clustering [6].
In particular, privacy-preserving solutions have been proposed for the following classification algorithms (to name a few): decision trees, naive Bayes classifier [8], and support vector machine (SVM) [9] Generally speaking, the existing works have taken either randomization-based approaches or cryptography- based approaches[7] Randomization-based approaches, by perturbing data, only guarantee a limited degree of privacy.
A.C.Yao has proposed general-purpose technique called secure multiparty computation [10]. The works of secure multiparty computation originate from the solution to the millionaire problem proposed by Yao, in which two millionaires can find out who is richer without revealing the amount of their wealth. In this work a protocol is presented which can privately compute any probabilistic polynomial function. Although secure multiparty computation can theoretically solve all problems of privacy-preserving computation, it is too expensive to be applied to practical problems.
Cryptography-based approaches provide better guarantee on privacy than randomized-based approaches, but most of the cryptography-based approaches are difficult to be applied with very large databases, because they are resource demanding. For example, although Laur et al. proposed an elegant solution for privacy-preserving SVM in [9], their protocols are based on circuit evaluation, which is considered very costly in practice.
L.Wan, W. K. Ng, S. Han, and V. C. S. Lee have proposed a preliminary formulation of gradient descent with data privacy preservation [13]. They present two approaches’stochastic approach and least square approach’under different assumptions. Four protocols are proposed for the two approaches incorporating various secure building blocks for both horizontally and vertically partitioned data.
2.2 The Limitations of Present Practices
In the Literature Survey, work is carried on privacy preserving Data Mining for performing data analysis on distributed datasets. Further the solution proposed in the Privacy Preserving Gradient Descent Methods gives general formulation for preserving privacy of distributed datasets to solve optimization problems. There is no any work done related to gradient descent method applied for back propagation neural network with distributed datasets i.e. horizontal fragmentation and vertical fragmentation datasets. It is therefore required that this solution for the privacy preserving gradient descent method needs to be applied to the Neural Network for solving classification problem.
Chapter 3
PROBLEM DEFINITION
3.1 PROBLEM DEFINITION
The researchers have the interest to obtain the result of cooperative learning of multiple parties’ data for solving classification problems, but some researchers are reluctant to release their data to be exploited by others because of privacy and security concerns. Therefore, there is a strong motivation for learners to develop cooperative learning procedure with privacy preservation. Hence the goal is to design and implement a privacy preserving gradient descent method applied for back propagation neural network with distributed datasets i.e. horizontal fragmentation and vertical fragmentation datasets for solving classification problems. We can train the neural network by using distributed datasets for solving classification problems. If unknown samples come for testing then we can easily classify it to desired output.
Goals of dissertation are:
‘ Apply a standard non privacy preserving algorithm on Back Propagation Neural Network with different datasets.
‘ Implementing cryptography technique for preserving privacy of owner’s dataset.
‘ Implementing of Privacy preserving Least Square Approach Algorithm on horizontal partitioned dataset for Back propagation Neural Network Training.
‘ Implementation of Privacy preserving Least Square Approach Algorithm on Vertical partitioned dataset for Back propagation Neural Network Training.
‘ Analyze the results.
Chapter 4
SYSTEM DESIGN AND IMPLEMENTATION
In this dissertation work, focus is to implement the privacy-preserving distributed algorithm to securely compute the piecewise linear function for the neural network training process to obtain the desired output. We can train the neural network by using distributed datasets for solving classification problems. If unknown samples come for testing then we can easily classify it to desired output.
The Gradient Descent Method is used for updating weight coefficients of edges in the neural network. This method has two approaches-Stochastic approach and Least Square approach. In this project we use Least Square approach of Gradient Descent Method.
This chapter highlights on the architecture and implementation details of dissertation work.
4.1 ARCHITECTURE OF SYSTEM
Figure 4.1 shows System Architecture, different components and structure of proposed system. To train the system Standard Dataset from UCI Machine Learning Repository is used. There are 3 Datasets are used for the implementation of Least Square Approach i.e. Iris Flower Dataset, Tennis Dataset and Wine Dataset.
Modules in the proposed system are as follows
1. Implementing a standard non privacy preserving algorithm on Neural Network.
2. Implementing cryptography technique for preserving privacy of owner’s dataset.
3. Implementation of Privacy preserving Least Square Approach Algorithm on horizontal partitioned dataset for Back propagation Neural Network Training.
4. Implementation of Securely Computing the Piecewise Linear Sigmoid Function for Vertically partitioned dataset and Implementation of Privacy preserving Least Square Approach Algorithm on Vertically Partitioned dataset for Back propagation Neural Network Training.
5. Analysis of results.
Figure 4.1 Architecture of System
4.1.1 A standard non privacy preserving algorithm on Neural Network.
Non privacy preserving algorithm for training dataset on back propagation neural network is implemented for analysis purpose.

4.1.2 A cryptography technique for preserving privacy of owners dataset.
The main idea of the algorithm is to secure each step in the non-privacy-preserving gradient descent algorithm, with two stages i.e feed forward and back propagation. In each step, neither the input data from the other party nor the intermediate results can be revealed.
Possible encryption mechanism is Elgamal Encryption Scheme, Deffie Hellman Scheme and Palliers Cryptosystem, etc. The work will make use of Elgamal encryption scheme.
4.1.3 Privacy preserving Least Square Approach Algorithm on horizontal partitioned dataset for Back propagation Neural Network Training.
In this module we implement Least Square approach algorithm for horizontal partioned dataset for effectively obtaining exact output for classification of data between two parties.

4.1.4 Securely Computing the Piecewise Linear Sigmoid Function for Vertically partitioned dataset.
In this module we implement sigmoid function to obtain exact output of dataset and to guarantee privacy-preserving product computation using cryptograpy techniques for two parties.
To hide the intermediate results such as the values of hidden-layer nodes, the two parties randomly share each result so that neither of the two parties can imply the original data information from the intermediate results.

4.1.5 Privacy preserving Least Square Approach Algorithm on Vertically Partitioned dataset for Back propagation Neural Network Training.
The Gradient descent method uses Least Square Approach to determine Weight (w) to best fit training data in the Back propagation Neural Network to obtain exact output by minimizing the error rates.
After the entire process of private training, without revealing any raw data to each other, the two parties jointly establish a neural network representing the properties of the union dataset used for solving classification problems.

4.1.6 Comparison study and analysis of results.
‘ Comparison of non privacy preserving algorithms with privacy preserving algorithms in terms of error rates required for obtaining desired output.
‘ Comparison study between horizontally partitioned cases and vertically partitioned cases in terms of error rates required for obtaining desired output.
‘ To test the accuracy of testing datasets for horizontal partitioned datasets and vertical partitioned datasets.

4.2 IMPLEMENTATION DETAILS
System consists of five modules as shown in Figure 4.2. System is initiated from three training datasets such as Iris Flower Dataset, Wine Dataset and Tennis Dataset which are standard datasets taken from UCI Machine Learning Repository. Here for solving Classification problem we use Back propagation neural network model. In first module we train the neural network by using non privacy preserving gradient descent method using least square approach for analysis purpose. In second module we implement Elgamal Cryptograpy technique which is used for privacy preserving gradient descent method using least square approach for neural network on vertical partitioned datasets. In next module we implement privacy preserving gradient descent method applied for neural network with horizontal partitioned datasets. Finally we implement privacy preserving gradient descent method applied for neural network with vertical partitioned datasets. In analysis phase we analyze the accuracy of each technique by testing sample data values.
Figure 4.2: DFD for proposed system
4.2.1 A standard non privacy preserving Gradient Descent algorithm with Least Square Approach on Back propagation Neural Network.
Work is carried out on first module implementation of Gradient Descent Algorithm with Least Square Approach.
The back propagation is applied for Least Square Approach where the iteration is decided based on the tolerable error allowed to stop the training process. The literature behind the same and sample hand simulation is shown in the below mentioned figure 4.3.
Figure 4.3: Sample hand simulation of Neural Network
The connection we’re interested in is between neuron A (a hidden layer neuron) and neuron B (an output neuron)and has the weight WAB. The figure 4.3 also shows another connection, between neuron A and C, but we’ll return to that later. The algorithm works like this:
1. First apply the inputs to the network and work out the output ‘ remember this initial output could be anything, as the initial weights were random numbers.
2. Next work out the error for neuron B. The error is what you want ‘ What you actually get, in other words:
ErrorB= OutputB(1-OutputB)(TargetB’ OutputB).
3. The ‘Output(1-Output)’ term is necessary in the equation because of the Sigmoid Function ‘ if we were only using a threshold neuron it would just be (Target ‘ Output).
4. Change the weight. Let W+AB be the new (trained) weight and WAB be the initial weight.
W+AB= WAB + (ErrorB x OutputA)
Notice that it is the output of the connecting neuron (neuron A) we use (not B). We update all the weights in the output layer in this way.
5. Calculate the Errors for the hidden layer neurons. Unlike the output layer we can’t calculate these directly (because we don’t have a Target), so we Back Propagate them from the output layer (hence the name of the algorithm). This is done by taking the Errors from the output neurons and running them back through the weights to get the hidden layer errors. For example if neuron A is connected as shown to B and C then we take the errors from B and C to generate an error for A.
ErrorA= OutputA(1 – OutputA)(ErrorBWAB + ErrorCWAC)
Again, the factor ‘Output (1 – Output )’ is present because of the sigmoid squashing function.
6. Having obtained the Error for the hidden layer neurons now proceed as in stage 3 to change the hidden layer weights. By repeating this method we can train a network of any number of layers.

Figure 4.4 shows detailed Flowchart of Gradient Descent Algorithm with Least Square Approach on Back Propagation Neural network.

Figure 4.4 Flowcharts for Least Square Approach on Neural Network Training.

4.2.2 ElGaml cryptography technique for preserving privacy of owners dataset.
Work is carried out on second module Implementation of ElGamal Encryption cryptography technique for preserving privacy of owner’s dataset.
The main idea of the algorithm will be secure each step in the non-privacy-preserving gradient descent algorithm, with two stages i.e feed forward and back propagation. In each step, neither the input data from the other party nor the intermediate results can be revealed. Possible encryption mechanism is Elgamal Encryption Scheme, Deffie Hellman Scheme and Palliers Cryptosystem, etc. The work will make use of Elgamal encryption scheme.
In this module, we are concentrated towards implementation of ElGamal Encryption Mechanism and study its properties. In cryptography, the ElGamal encryption system is an asymmetric key encryption algorithm for public-key cryptography which is based on the Diffie’Hellman key exchange. It was described by Taher Elgamal in 1985.
The ElGamal cryptosystem is usually used in a hybrid cryptosystem. i.e., the message itself is encrypted using a symmetric cryptosystem and ElGamal is then used to encrypt the key used for the symmetric cryptosystem. This is because asymmetric cryptosystems like Elgamal are usually slower than symmetric ones for the same level of security, so it is faster to encrypt the symmetric key (which most of the time is quite small if compared to the size of the message) with Elgamal and the message (which can be arbitrarily large) with a symmetric cypher.
4.2.2.1 ElGamal encryption Algorithm:
The ElGamal encryption system is an asymmetric key encryption algorithm for public-key cryptography which uses secrete key to compute private key of decryption algorithm.
ElGamal encryption consists of three components: the key generator, the encryption algorithm, and the decryption algorithm.

1. Key generator
i. Generate large prime p and generator g of the multiplicative Group of the integers
modulo p.
ii. Select a random integer a, 1 <= a<= p- 2, and compute ga mod p.
iii. Public key is (p; g; ga); secrete key is a.
2. Encryption
i. Obtain public key (p, g, ga).
ii. Represent the message as integers m in the range {0,1,’.,p-1}.
iii. Select a random integer k, 1<=k<=p-2.
iv. Compute ??=gk mod p and ?? = m*(ga )k.
v. Send ciphertext c=(??, ??).
3. Decryption
i. Use secret key a to compute (??p-1-a) mod p.
ii. Recover m by computing (??a) * ?? mod p.

Example:
1. Key generation
e.g. Public key (17, 6,7):
Prime p=17
Generator g=6
Secret key a=5
Public key ga mod p = 65 mod 17 = 7.

2. Encryption
Message for encryption is m=13
Chosen random k=10
Computed ??= gk mod p= 610 mod 17 = 15
Encrypted message ??= m*gk mod p = (13 * 710) mod 17 =9
Sends ??=15 and ??=9 for decryption

3. Decryption
It receives ??=15 and ??=9
Public key is (p, g, ga)= (17, 6, 7)
Secret key is a=5.
Decryption factor
(??-a)*?? mod p= 15-5 mod 17= 1511 mod 17 = 9
Decryption
(?? *9) mod p = (9*9) mod 17 =13
Decrypted message is: 13

4.2.2.2 Properties of ElGamal Encryption Scheme:
Homomorphic Property: For two messages m1 and m2, an encryption of can be obtained by an operation on E(m1,r) and E(m2,r) without decrypting any of the two encrypted messages.
Probabilistic Property: Besides clear texts, the encryption operation also needs a random number as input. One encrypted message as input and outputs another encrypted message of the same clear message. This is called re-randomization operation.

4.2.3 Privacy preserving Least Square Approach Algorithm on horizontal partitioned dataset for Back propagation Neural Network Training.
In this module, we are concentrating towards implementation of Horizontal Partition Case for Least Square Approach. This Horizontal Partition Case is experimented for two parties. Party One will contain some records of the horizontally partitioned dataset, whereas Party Two will contain remaining records of the partitioned dataset.
1. Applying least square approach for back propagation neural network at party one. It will generate the output weight vectors. These weight vectors are stored in .csv file format.
2. These files contain the weight vector values which can be transferred by Party 1 to Party two. These weight vector values will not disclose any sort of input to the other party and hence preserves the privacy of the data elements.
3. Party two further performs his training by reading this partial trained weight vector values using least square approach for back propagation neural network and the final generated weight vector values can be used by both parties for performing testing.
4.2.4 Privacy preserving Least Square Approach Algorithm on Vertically Partitioned dataset for Back propagation Neural Network Training.
1. Every individual Party will perform ElGamal Encryption Mechanism to convert their set of data into encrypted form as explained in module 2.
2. All encrypted data is collected at a single repository.
3. Normalization of the record is made in the range of -1 to 1
For this we use following equation
In = (I-min) ((newMax-newMin)/(Max-Min)) +newMin
In our case newMin is -1 and newMax = +1
4. Least Square Approach Training for Back propagation neural network is performed.
For better understanding, the back propagation learning algorithm can be divided into two phases: propagation and weight update.

Phase 1: Propagation
Each propagation involves the following steps:
‘ Forward propagation of a training pattern’s input through the neural network in order to generate the propagation’s output activations.
‘ Backward propagation of the propagation’s output activations through the neural network using the training pattern’s target in order to generate the deltas of all output and hidden neurons.
Phase 2: Weight update
For each weight-synapse follow the following steps:
‘ Multiply its output delta and input activation to get the gradient of the weight.
‘ Bring the weight in the opposite direction of the gradient by subtracting a ratio of it from the weight.

This ratio influences the speed and quality of learning; it is called the learning rate. The sign of the gradient of a weight indicates where the error is increasing; this is why the weight must be updated in the opposite direction.
Repeat phase 1 and 2 until the performance of the network is satisfactory.
4.2.5 Comparison Study and Analysis.
In this module, we analyses the accuracy factor between Non Privacy Preserving Least Square Approach, Privacy Preserving Least Square Approach on Horizontal Dataset and Privacy Preserving Least Square Approach on Vertical Dataset. Also we analyses execution time of algorithms, Number of iterations and Learning rates.
4.2.6 CLASS DIAGRAM
Figure 4.5 shows the class diagram of the system. System contains UtilClass class which contains main method. ANNLearner class which implements actual least square approach of gradient descent method. Dataset class print input and output of dataset values. Form class shows the graphical presentation of system. AttributeSet class print the attributes of datasets.

Figure 4.5: Class Diagram of system
4.2.7 SEQUENCE DIAGRAM
Figure 4.6 shows sequence diagram of the system. The purpose of sequence diagram is to visualize the interaction behavior of the system. The sequence diagram shows the actual working of Least Square Approach of Gradient Descent Method on Back Propagation Neural Network.

Figure 4.8: Sequence diagram of system

Chapter 5
EXPERIMENTAL RESULTS
This chapter elaborates the experimental dataset, results of the system and evaluation on these results.
5.1 Experimental Data
Several standard datasets are available for experimental purpose which are taken from UCI Machine Learning Repository. 3 Datasets are used for the implementation of Least Square Approach of Gradient Descent Method on Back Propagation Neural Network.
1. Iris Flower Dataset- Numerical, Continuous and Multivariate
Dataset Information:
Input:
sepal-length continuous
sepal-width continuous
petal-length continuous
petal-width continuous
Output:
Iris Iris-setosa Iris-versicolor Iris-virginica
2. Tennis Dataset- Character, Weather Character Strings and Multivariate
Input:
Outlook Sunny Overcast Rain
Temperature Hot Mild Cool
Humidity High Normal
Wind Weak Strong
Output:
PlayTennis Yes No
3. Wine Dataset ‘ Numerical, Continuous and Multivariate
Input:
Alcohol continuous
Malicacid continuous
Ash continuous
Alcalinityofash continuous
Magnesium continuous
phenols continuous
Flavanoids continuous
Nonflavanoid continuous
Proanthocyanins continuous
Color continuous
Hue continuous
OD280 continuous
Proline continuous
Output:
Wine-class 1 2 3

5.2 Experimental Results and Analysis:
The figure 5.1 shows GUI Interface for the implementation.

Figure 5.1: GUI Interface

5.2.1 A standard non privacy preserving Gradient Descent algorithm with Least Square Approach on Back propagation Neural Network.
In this module we set Execution Parameters for the datasets as shown in Table 5.1
Sr.No Name of Dataset Number of Hidden Units Layer Structure Learning Rate Momentum Number of Iterations
1 Iris 2 4-2-3 0.25 0.1 450
2 Tennis 2 4-2-1 0.25 0.1 450
3 Wine 10 13-10-3 0.25 0.1 5000
Table 5.1 Execution Parameters for the datasets
The Results of the Testing Accuracy for the above mentioned parameters is as shown in Table 5.2
Sr.No Name of Dataset Number of Test Samples Correct Predictions Incorrect Prediction Percentage Accuracy
1 Iris 50 49 1 98%
2 Tennis 4 2 2 50%
3 Wine 13 9 4 69.23%
Table 5.2 Testing Accuracy for the datasets
Figure 5.2 shows the graph of testing accuracy of non privacy preserving gradient descent method applied on different datasets.

Figure 5.2: Percentage Accuracy on Datasets
The Percentage accuracy can be improved of any dataset by training the network for longer duration with less of learning rate. This behavior can be analyzed using the statistics as shown in Table 5.3
Name of Dataset: Tennis
Learning Rate Number of Iterations Percentage Accuracy Remarks
0.25 450 50 Sample 1
0.1 450 75 Improvement in test results due to reduction of learning rate against Sample 1
0.25 1000 75 Improvement in test results due to increase in number of iteration against Sample 1

Table 5.3 Improvement of percentage accuracy of Tennis dataset
The Execution Time of the algorithms is also recorded. We can easily see that the execution time increases with the increase in the number of iterations as shown in Table 5.4
Name of Dataset: Wine
Number of Iterations Learning Rate Execution Time (Nanoseconds) Remarks
250 0.25 153.34 Sample 1
250 0.01 158.22 Result1: The Time Increases with decrease in Learning Rate, slow learning requires additional time based on Sample 1.
250 0.001 164.21 Further reduction in Learning Rate increases the execution time based on above Result 1
1000 0.25 544.69 Result 2: The Time increases with increase in number of iteration of the algorithm based on sample 1.
2000 0.25 1035.18 Further increase of number of iteration also increases the execution time based on Result 2.
Table 5.3 Execution time of algorithm for Tennis dataset
5.2.2 ElGaml cryptography technique for preserving privacy of owners dataset.
We apply ElGamal Encryption Scheme on Iris Dataset. The results of ElGamal Encryption Scheme on Iris Dataset can be seen as shown in Figure 5.3

Figure 5.3: ElGamal Encryption Scheme on Iris Dataset
In this module we concentrate on generating the truncated values dataset which is a basic requirement for ElGamal Encryption Scheme. Following Figure 5.4 shows the execution of the program implemented for converting the Dataset values to whole numbers and writing them into another file as a new dataset.

Figure 5.4: Conversion of dataset values to whole numbers
On this updated dataset if Neural Network Training is performed certain accuracy loss is observed. Let us compare this accuracy loss due to truncation of the dataset. Table 5.4 indicates the Parameters used for normal and updated dataset for the sake of training.
Name of Dataset Iris
Total Number of Train Samples 100
Total Number of Test Samples 50
Number of Hidden Units 2
Learning Rate 0.25
Momentum 0.1
Number of iterations 450

Table 5.4 Execution Parameters used for normal and updated dataset

Table 5.5 indicates the Accuracy Factor involved in the training of iris and iris new dataset (as needed for ElGamal Scheme).
Name of dataset Iris IrisNew(Encrypted)
Number of Test Samples 50 50
Number of Test Sample Correctly Calculated 49 48
Percentage Accuracy 98% 96%

Table 5.4 Percentage Accuracy of datasets
From Table 5.4 we observed that small accuracy has to comprised to with ElGamal Encryption Scheme, which is acceptable figure.
5.2.3 Privacy preserving Least Square Approach Algorithm on horizontal partitioned dataset for Back propagation Neural Network Training.
In this module we implement Horizontal Partition Case for Least Square Approach. This Horizontal Partition Case is experimented for two parties. Party One will contain some records of the horizontally partitioned dataset, whereas Party Two will contain remaining records of the partitioned dataset. Figure 5.5 shows the screenshot of number of records read by Party one.

Figure 5.5: screenshot of number of records read by Party one
Further the training is carried out with the parameters as shown in above figure 5.5. The weight vector values are generated accordingly on the 54 records of this iris dataset as shown in Figure 5.6.

Figure 5.6: weight vector values generated for iris dataset
After the learning process at Party one is completed, two files are exported
1. in2hidMat.csv
2. hid2outMat.csv
These exported files contain the weight vector values which can be transferred by Party one to Party two. These weight vector values will not disclose any sort of input to the other party and hence preserves the privacy of the data elements as shown in Figure 5.7.

Figure 5.7: Screenshots of exported files containing weight vector values
The exported files are further given to Party two for performing his set of training. Hence we transfer the file from Party one to Party two as shown in Figure 5.8.

Figure 5.8: Screenshots of exported files containing weight vector values at Party two
Party two further performs his training by reading this partial trained weight vector values and the final generated weight vector values can be used by both parties for performing testing as shown in Figure 5.9.

Figure 5.9: Screenshots of training performed at Party two

Testing Results either at Party one or Party two are shown in Figure 5.10.

Figure 5.10: Screenshots of Testing performed at Party two
We executed horizontally partitioned case on multiple dataset and compared the results with Non Privacy Preserving Algorithms.
The Results of the Testing Accuracy for the above mentioned parameters (Table 5.1) is as shown in Table 5.5.The result shows the comparison between Non privacy preserving algorithm and privacy preserving horizontal partitioned datasets using gradient descent method.

Non Privacy Preserving Algorithm. Privacy Preserving Horizontal Algorithm for Least Square Approach
Name of Dataset Number of Test Samples Correct Predictions Incorrect Prediction Accuracy Correct Predictions Incorrect Prediction Accuracy
Iris 50 49 1 98% 47 2 94%
Wine 13 9 4 69.23% 9 4 69.23%
Tennis 4 2 2 50% 2 2 50%

Table 5.5 Testing Accuracy of datasets

Figure 5.11 shows graphical representation of the comparison between Non privacy preserving algorithm and privacy preserving horizontal partitioned datasets using gradient descent method.
Figure 5.10: Comparison of testing results between non privacy and privacy preserving algorithm
From Table 5.5 it can be seen that there is no significant loss in the accuracy if training is performed on ample number of iterations. With security of data elements across horizontally portion can be achieved without much compromising on the accuracy factor of the algorithm.

5.2.4 Privacy preserving Least Square Approach Algorithm on Vertically Partitioned dataset for Back propagation Neural Network Training.
In this module, we are implemented Least Square Approach of gradient descent method for Vertical Partition Case. This Vertical Partition Case is experimented a dataset which is divided in two parties and they together want to train a neural network without revealing their actual content.

We make use of ElGamal Encryption Algorithm for keeping the data integrity.In cryptography, the ElGamal encryption system is an asymmetric key encryption algorithm for public-key cryptography, which is based on the Diffie’Hellman key exchange.
First, every party needs to encrypt their data using the Elgamal encryption technique individually and send the data to trusted third party where the data will be collected in Encrypted. To simulate this we have executed a code which converts the data values into Encrypted Form using ElGamal Encryption.
Dataset used for ElGamal Encryption is Iris Dataset.
Train Record Set: 100 records
Test Record Set: 50 records.
Figure 5.11 shows the screenshot of execution of ElGamal encryption algorithm on Iris Dataset.

Figure 5.11: Screenshot of execution of ElGamal encryption algorithm on Iris Dataset
This Program produces the encrypted values for all the data records. This simulation can be done by individual parties.
Once all these data records are made available in Big Integer data type in an encrypted form, then it will not reveal any of the information to the users as the data is in encrypted form. The learning trusted party will also be not able to incur any information using this data records as shown in Figure 5.12.

Figure 5.12: Screenshot of Encrypted Iris dataset
Similarly the test records can also be encrypted. Figure 5.13 shows the screenshot of big integer data records.

Figure 5.12: Screenshot of big integer data records
Further all the data records are needs to normalize because to get higher accuracy of neural network. Normalization is carried out in the range of -1 to +1.
For this we use following equation
In = (I-min) ((newMax-newMin)/(Max-Min)) +newMin
In our case newMin is -1 and newMax = +1
Figure 5.13 shows the screenshots of normalized data records.

Figure 5.13: Screenshot of normalized data records
Figure 5.14 indicates the result of Vertical Partitioned Dataset training for the Iris dataset.

Figure 5.14: Screenshot of Vertical Partitioned Dataset training for the Iris dataset
5.2.4 Analysis of Work
We analyses the accuracy factor between Non Privacy Preserving Least Square Approach, Privacy Preserving Least Square Approach on Horizontal Dataset and Privacy Preserving Least Square Approach on Vertical Dataset. Table 5.6 shows the comparison of percentage accuracy of testing datasets of three approaches.
Non Privacy Preserving Algorithm. Privacy Preserving Horizontal Algorithm for Least Square Approach Privacy Preserving Vertical Algorithm for Least Square Approach
Name of Dataset Number of Test Samples Correct Predictions Incorrect Prediction Accuracy Correct Predictions Incorrect Prediction Accuracy Correct Predictions Incorrect Prediction Accuracy
Iris 50 49 1 98% 47 2 94% 42 8 84%
Wine 13 9 4 69.23% 9 4 69.23% 9 4 69.23%
Tennis 4 2 2 50% 2 2 50% N.A N.A N.A

Table 5.5 Testing Accuracy of datasets
The Tennis dataset is a character based dataset. The ElGamal Security scheme works with numeric data type. So in above Table 5.5 there is no result shown for vertical partitioned datasets. Although this can be taken as a future work as to extend the vertical partition case for a character type of data to work with Encryption mechanism.
Figure 5.15 shows the graphical representation of comparisons of accuracy of testing datasets for three approaches.

Figure 5.15: Graphical representation of comparisons of accuracy of testing datasets for three approaches.
From Figure 5.15 it is observed that there is no significant loss in the accuracy if training is performed on ample number of iterations. The number of iterations and the learning rate are modified for the Privacy Preserving Vertical Partition Case as the cryptographic implementation makes the data loss to happen. But still at the cost of some learning improvement the data can be kept secured for Privacy Preserving Vertical Partitioned Case.

Chapter 6
CONCLUSION
Gradient descent method is used for solving many optimizations and learning problems. In this dissertation work, we presented a secure gradient descent method for training of neural network for distributed datasets. Gradient Descent method contains two approaches i.e. stochastic approach and Least Square approach. For our work least square approach is used and it is well suited for training of neural network. We use Back Propagation neural network for training purpose.
We works on privacy preserving protocols for securely performing gradient descent method over vertically or horizontally partitioned data based on the least square approach between two trusted parties. Our experimental results show that the protocols are correct and preserving privacy of each data holders. We also conducted experiments to analyze the results of non privacy preserving least square approach with privacy preserving least square approaches for horizontal and vertical portioned datasets. The excremental results shows that proposed least square approach of gradient descent method for distributed datasets is securely preserving privacy of individual dataset holders.

Future Work
For future work, we will extend our dissertation work for distributed datasets i.e. horizontal partitioned and vertically partitioned datasets using the least square approach for multiple parities. Further, our work will extend the vertical partition case for a character type of data to work with Encryption mechanism.
References:
[1] HIPPA, National Standards to Protect the Privacy of Personal Health Information, [Online]. Available: http://www.hhs.gov/ocr/hipaa/finalreg.html
[2] M. Chicurel, ‘Data basing the brain,’ Nature, vol. 406, pp. 822’825, Aug. 2000.
[3] D. Agrawal and R. Srikant, ‘Privacy-preserving data mining,’ in Proc. ACM SIGMOD
[4] Y. Lindell and B. Pinkas, ‘Privacy preserving data mining,’ in Lecture Notes in Computer Science. Berlin, Germany: Springer-Verlag, 2000, vol. 1880, pp. 36’44.
[5] N. Zhang, S. Wang, and W. Zhao, ‘A new scheme on privacy-preserving data classification,’ in Proc. ACM SIGKDD Int. Conf. Knowl. Disc. Data Mining, 2005.
[6] G. Jagannathan and R. N. Wright, ‘Privacy-preserving distributed k-means clustering over arbitrarily partitioned data,’ in Proc. ACM
[7] O. Goldreich, Foundations of Cryptography. Cambridge Univ. Press, 2001.
[8] R. Wright and Z. Yang, ‘Privacy-preserving Bayesian network structure computation on distributed heterogeneous data,’ in Proc. 10th ACM SIGKDD.
[9] H. Yu, X. Jiang, and J. Vaidya, ‘Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data,’ in Proc. Annu. ACM Symp. Appl. Comput., 2006.
[10] A. C. Yao, ‘Protocols for secure computations,’ in Proc. 23rd Annu. Symp. Found. Comput. Sci., Chicago, IL, Nov. 1982.
[11] M. Barni, C. Orlandi, and A. Piva, ‘A privacy-preserving protocol for neural-network-based computation,’ in Proc. 8th Workshop Multimedia Security, New York, 2006.
[12] A. Yao, ‘How to generate and exchange secrets,’ in Proc. 27th IEEE Symp. Found. Comput. Sci., 1986, pp. 162’167.
[13] L.Wan, W. K. Ng, S. Han, and V. C. S. Lee, ‘Privacy-preservation for gradient descent methods,’ in Proc. IEEE Transactions on Knowlede and Data Engineering, 2010.

Source: Essay UK - http://doghouse.net/essays/information-technology/essay-data-mining-2/


Not what you're looking for?

Search our thousands of essays:

Search:


About this resource

This Information Technology essay was submitted to us by a student in order to help you with your studies.



Word count:

This page has approximately words.


Share:


Cite:

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay UK, Essay: Data mining. Available from: <http://doghouse.net/essays/information-technology/essay-data-mining-2/> [22-02-19].


More information:

If you are the original author of this content and no longer wish to have it published on our website then please click on the link below to request removal:


Essay and dissertation help


Latest essays in this category:


Our free essays:

badges