Implementation Of Clustering Algorithms In RapidMiner

Abstract - In data mining, Clustering can be considered as the most unsupervised learning techniques. Clustering is a process of grouping a set of physical (or abstract) objects into class whose members are similar in some way. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the object belonging to other cluster. In this paper, the implementation of clustering algorithms in RapidMiner is discussed.

Keywords - Clustering, K-means, DBScan, K-medoids, RapidMiner

INTRODUCTION
Clustering can be used for data mining, information retrieval, text mining, web analysis and marketing etc. There are various Clustering algorithms but we have implemented three algorithms (K-means, DBScan and K-medoids) in RapidMiner. RapidMiner is a software platform that provides an integrated environment for machine learning, text and data mining, predictive and business analytics. We took 'IRIS' dataset for applying the algorithms in which we have the five columns (or attributes) that is Sepal length, Sepal width, Petal length, Petal width and Species. Except Species that is polynomial, all of the other attributes are numeric.

K-MEANS AND K-MEDOIDS
K-means clustering is an exclusive clustering algorithm that is each object is assigned to precisely one of a set of clusters. The similarity between objects is based on a measure of the distance between them.
In K-means algorithm, First of all we need to introduce the notion of the center of a cluster, generally called its centroid. Assuming that we are using Euclidean distance or something similar as a measure we can define the centroid of a cluster to be the point for which each attribute value is the average of the values of the corresponding attribute for all the points in the cluster. The centroid of a cluster will sometimes be one of the points in the cluster, but frequently it will be an imaginary point, not part of the cluster itself, which we can take as marking its center. We have the following steps:-
1. Choose a value of k.
2. Select k objects in an arbitrary fashion.
3. Use these as the initial set of k centroids.
4. Assign each of the objects to the cluster for which it is nearest to the centroid.
5. Recalculate the centroids of the k clusters.
6. Repeat steps 3 and 4 until the centroids no longer move.
These steps can be shown as follows:-

First we have retrieve the data, then we have applied K-means operator. Because K-means works on numeric data, therefore we have used the operator nominal to numeric.
We defined the value of K as 5 and we get the ouput as follows:-

In k-medoids algorithm, First of all we need to introduce the notion of the center of a cluster, generally called its centroid. Assuming that we are using Euclidean distance or something similar as a measure we can define the centroid of a cluster to be the point for which each attribute value is the average of the values of the corresponding attribute for all the points in the cluster. The centroid of a cluster will always be one of the points in the cluster. This is the major difference between the k-means and k-medoids algorithm. In k-means algorithm the centroid of a cluster will frequently be an imaginary point, not part of the cluster itself, which we can take as marking its center.
In RapidMiner this can be done as follows:-


We have defined the value of K as 5 and get the output as follows:-

DBSCAN
DBSCAN's definition of a cluster is based on the notion of density reachability. Basically, a point q is directly density-reachable from a point p if it is not farther away than a given distance epsilon (i.e. it is part of its epsilon-neighbourhood) and if p is surrounded by sufficiently many points such that one may consider p and q to be part of a cluster. q is called density-reachable (note the distinction from "directly density-reachable") from p if there is a sequence p(1),',p(n) of points with p(1) = p and p(n) = q where each p(i+1) is directly density-reachable from p(i). DBSCAN requires two parameters: epsilon and the minimum number of points required to form a cluster (minPts). epsilon and minPts can be specified through the epsilon and min points parameters respectively. DBSCAN starts with an arbitrary starting point that has not been visited. This point's epsilon-neighbourhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized epsilon-environment of a different point and hence be made part of a cluster. If a point is found to be a dense part of a cluster, its epsilon-neighbourhood is also part of that cluster. Hence, all points that are found within the epsilon-neighbourhood are added, as is their own epsilon-neighbourhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise. If no id attribute is present, this operator will create one. The 'Cluster 0' assigned by DBSCAN operator corresponds to points that are labeled as noise. These are the points that have less than min points in their epsilon-neighbourhood.
In RapidMiner this can be shown as follows:-

We have defined the value of epsilon as 10 and get the output as follows:-

CONCLUSIONS
For small business, RapidMiner might provide the competitive edge to compete with larger organizations with deeper pockets for BI/DW(Business Intelligence and Data Warehousing). It is a great tool and a good opportunity to see the BI/DW tools in action for students. There are affordable tools out there, but you still have to learn the underlying concepts of BI/DW. This paper gives an implementation of clustering algorithms in RapidMiner.

 

REFERENCES

IEEE Trans Xu R. Survey of clustering algorithms. Neural Networks 2005.

Fabrizio Angiulli and C Pizzuti. Fast outlier detection in high dimensional spaces. volume 2431 of Lecture Notes
Computer Science, pages 43{78. Springer, 2002.

H Kriegel, P Kroger, E Schubert, and A Zimek. Loop: local outlier probabilities. In Proc. of CIKM '09,
pages 1649{1652, USA, 2009. ACM.

H.-P., Sander J., and Xu X. Ester M., Kriegel 'A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases
with Noise' In Proceedings of the 2nd Conference on Knowledge Discovery and Data Mining (KDD'96), Portland: Oregon, pp. 226-231 International

Jiawei Han and Micheline Kamber Data Mining Concepts and Techniques. Morgan Kaufmann Publisher, 722

J Tang, Z Chen, A Fu, and D Cheung. Enhancing effectiveness of outlier detections for low density patterns.
volume 2336 of Lecture Notes in Computer Science, pages 535{548. Springer, 2002.

Margaret H. Dunham, Data Mining 'Introduction and Advanced Topics'.

W Jin, A Tung, J Han, and W Wang. Ranking outliers using symmetric neighborhood relationship. volume 3918
of Lecture Notes in Computer Science, pages 577{593. Springer, 2006.

Source: Essay UK - http://doghouse.net/free-essays/information-technology/implementation-clustering-algorithms.php


Not what you're looking for?

Search:


About this resource

This Information Technology essay was submitted to us by a student in order to help you with your studies.


Rating:

Rating  
No ratings yet!


Word count:

This page has approximately words.


Share:


Cite:

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay UK, Implementation Of Clustering Algorithms In RapidMiner. Available from: <http://doghouse.net/free-essays/information-technology/implementation-clustering-algorithms.php> [22-02-19].


More information:

If you are the original author of this content and no longer wish to have it published on our website then please click on the link below to request removal:


Essay and dissertation help

badges