Introduction This abstract is written with the support of the text
This abstract is written with the support of the text…
The main concept of cluster analysis (clustering) is splitting multiple objects into groups called clusters. Each group should contain related (in terms of similar characteristics) objects inside, and the objects of different groups should be as distinct as possible. The main distinguishing feature of clustering is that the list of groups formed by this method is not clearly defined at the start but in the process of algorithm operation.
There are two main requirements for a successful clustering procedure:
The inter-cluster similarities should be high, i.e. the distance between objects within a cluster should be less.
The intra-cluster similarities should be low, i.e. the distance between objects between all of the clusters should be more.
As soon as these two stations are satisfied, the algorithm can be estimated as a successful one.
-The general scheme of the clustering method application.-
The cluster analysis application, in general, can be reduced to the following steps:
Sample object selection for clustering;
Definition (and normalization if necessary) of the set of variables by which the sample objects will be evaluated.
Determining similarities between objects by means of calculation the attribute values.
Applying cluster analysis method for similar objects (clusters) group creation.
The analysis results presentation.
Adjusting the current metric system and the clustering method for better results if necessary.
-Clustering similarity calculation-
There are several metrics that are used in a determination of cluster similarities: (???????? ???????)
Squared Euclidean distance
In the current section, there will be slightly described the most common methods of the cluster analysis application.
The core idea of this group of methods is constructed on the principle that objects tend to be more related to their closest neighbours rather than to objects located far away. Therefore, the cluster description is being narrowed to the maximum distance required to connect each part of the cluster. To visualize the working principle of this method a dendrogram is used — a tree diagram which shows taxonomic relationships between groups. The classic example of such a tree is the animal or plant classification.
There are two principal types of the algorithms of hierarchical clustering: agglomerative and divisive approaches. The second one works on the top-down principle: at the beginning, all observations start in a single cluster, which is then split into smaller clusters recursively. It is much more common to meet an agglomerative type: at the beginning, each observation starts in a separate cluster, and then pairs of clusters are merged as one moves up the hierarchy.
In general, the cluster finding process starts from a set of M clusters, each object belonging to the single cluster. Each cluster is represented by a tentative centroid — a vector which contains one number for each variable and where each number is the mean of each variable for the observations in the current cluster. Therefore, the general algorithm has the following form:
The first random object is determined as a centroid for the first cluster;
The similarity between the next object and each existing cluster centroid is calculated in accordance with the metrics specified above;
The object is added to the cluster if the highest calculated similarity if greater than the beforehand setup threshold value;
If there are more unclustered objects, repeat from step 2.
Obviously, there are a large number of algorithms based on these clustering methods. We will consider the principle of action of one of the most popular — k-means algorithm — as most of the remainings have a similar interpretation, but much more situational optimization.
The k-means clustering is arguably the most popular and the simplest, but at the same time, rather an inaccurate method of clustering in the classical implementation. It splits a whole set of elements of the vector space into a prescribed number of clusters K. The effect of the algorithm is that it tends to minimize the standard deviation at the points of each cluster. Each iteration recalculates the centre of mass for each cluster obtained in the previous step, then the vectors are broken down into clusters once again in accordance with the new centres’ recalculated similarity rate. The algorithm ends when no cluster changes occur at any iteration.
As it may be seen from the description, this algorithm is the most vivid representative of the partitioning method, thus it inherits every basic concept of it. For the one hand, it is pretty fast in comparison with other algorithms, as its time complexity is linear. For the other hand, the drawbacks should be mentioned: they are predetermined number of clusters and sensitivity to the initial cluster centers choice.
The sequence of actions is quite similar to the partition method’s one: