Introduction
Cluster analysis
is the process of grouping objects into subsets that have meaning in the context of a particular problem. The objects are thereby organized into an efficient representation that characterizes the population being sampled. Unlike classification, clustering does not rely on predefined classes. Clustering is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. It can uncover previously undetected relationships in a complex data set. Many applications for cluster analysis exist. For example, in a business application, cluster analysis can be used to discover and characterize customer groups for marketing purposes.
Two types of clustering algorithms are nonhierarchical and hierarchical. In nonhierarchical clustering, such as the kmeans algorithm, the relationship between clusters is undetermined. Hierarchical clustering repeatedly links pairs of clusters until every data object is included in the hierarchy. With both of these approaches, an important issue is how to determine the similarity between two objects, so that clusters can be formed from objects with a high similarity to each other. Commonly, distance functions, such as the Manhattan and Euclidian distance functions, are used to determine similarity. A distance function yields a higher value for pairs of objects that are less similar to one another. Sometimes a similarity function is used instead, which yields higher values for pairs that are more similar.
Distance Functions
Given two pdimensional data objects i = (x_{i}_{1},x_{i}_{2}, ...,x_{ip}) and j = (x_{j}_{1},x_{j}_{2}, ...,x_{jp}), the following common distance functions can be defined:
Euclidian Distance Function:
Manhattan
Distance Function:
When using the Euclidian distance function to compare distances, it is not necessary to calculate the square root because distances are always positive numbers and as such, for two distances, d_{1} and d_{2},
Ö
d
_{
1
}
>
Ö
d
_{
2
}
Û
d
_{1} > d_{2}. If some of an object's attributes are measured along different scales, so when using the Euclidian distance function, attributes with larger scales of measurement may overwhelm attributes measured on a smaller scale. To prevent this problem, the attribute values are often normalized to lie between 0 and 1.
Other distance functions may be more appropriate for some data.
k
means Algorithm
The kmeans algorithm is one of a group of algorithms called partitioning methods. The problem of partitional clustering can be formally stated as follows: Given n objects in a ddimensional metric space, determine a partition of the objects into k groups, or clusters, such that the objects in a cluster are more similar to each other than to objects in different clusters. Recall that a partition divides a set into disjoint parts that together include all members of the set. The value of k may or may not be specified and a clustering criterion, typically the squarederror criterion, must be adopted.
The solution to this problem is straightforward. Select a clustering criterion, then for each data object select the cluster that optimizes the criterion. The kmeans algorithm initializes k clusters by arbitrarily selecting one object to represent each cluster. Each of the remaining objects are assigned to a cluster and the clustering criterion is used to calculate the cluster mean. These means are used as the new cluster points and each object is reassigned to the cluster that it is most similar to. This continues until there is no longer a change when the clusters are recalculated. The algorithm is shown in Figure 1.
k
means Algorithm:

Select k clusters arbitrarily.

Initialize cluster centers with those k clusters.

Do loop
a) Partition by assigning or reassigning all data objects to their closest cluster center. b) Compute new cluster centers as mean value of the objects in each cluster. Until no change in cluster center calculation

Figure 1: kmeans Algorithm
Figure 2: Sample Data
Example:
Suppose we are given the data in Figure 2 as input and we choose k=2 and the Manhattan distance function dis = x_{2}x_{1}+y_{2}y_{1}. The detailed computation is as follows:
Step 1. Initialize k partitions. An initial partition can be formed by first specifying a set of k seed points. The seed points could be the first k objects or k objects chosen randomly from the input objects. We choose the first two objects as seed points and initialize the clusters as C_{1}={(0,0)} and C_{2}={(0,1)}.
Step 2. Since there is only one point in each cluster, that point is the cluster center.
Step 3a. Calculate the distance between each object and each cluster center, assigning the object to the closest cluster.
For example, for the third object:
dis(1,3) = 10 + 10 = 2 and
dis(2,3) = 10 + 11 = 1,
so this object is assigned to C_{2}.
The fifth object is equidistant from both clusters, so we arbitrarily assign it to C_{1}.
After calculating the distance for all points, the clusters contain the following objects:
C_{1} = {(0,0),(1,0),(0.5,0.5)} and
C_{2} = {(0,1),(1,1),(5,5),(5,6),(6,6),(6,5),(5.5,5.5)}.
Step 3b. Compute new cluster center for each cluster.
New center for C_{1} = (0.5,0.16)
(0+1+0.5)/3 = 0.5
(0+0+0.5)/3 = 0.16
New center for C_{2} = (4.1,4.2)
(0+1+5+5+6+6+5.5)/7 = 4.1
(1+1+5+5+6+6+5.5)/7 = 4.2
Step 3a'. New centers C_{1} = (0.5,0.16) and C_{2} = (4.1,4.2) differ from old centers C_{1}={(0,0)} and C_{2}={(0,1)}, so the loop is repeated.
Reassign the ten objects to the closest cluster center, resulting in:
C_{1} = {(0,0),(0,1),(1,1),(1,0),(0.5,0.5)}
C_{2} = {(5,5),(5,6),(6,6),(6,5),(5.5,5.5)}
Step 3b'. Compute new cluster center for each cluster
New center for C_{1} = (0.5,0.5)
New center for C_{2} = (5.5,5.5)
Step 3a''. New centers C_{1} = (0.5,0.5) and C_{2} = (5.5,5.5) differ from old centers C_{1} = (0.5,0.16) and C_{2} = (4.1,4.2), so the loop is repeated.
Reassign the ten objects to the closest cluster center. Result is same as in Step 5.
Step 3b''. Compute new cluster centers.
Algorithm is done: Centers are the same as in Step 3b', so algorithm is finished. Result is same as in 3b'.
Agglomerative Hierarchical Algorithm
Hierarchical algorithms can be either agglomerative or divisive, that is topdown or bottomup. All agglomerative hierarchical clustering algorithms begin with each object as a separate group. These groups are successively combined based on similarity until there is only one group remaining or a specified termination condition is satisfied. For n objects, n1 mergings are done. Hierarchical algorithms are rigid in that once a merge has been done, it cannot be undone. Although there are smaller computational costs with this, it can also cause problems if an erroneous merge is done. As such, merge points need to be chosen carefully. Here we describe a simple agglomerative clustering algorithm. More complex algorithms have been developed, such as BIRCH and CURE, in an attempt to improve the clustering quality of hierarchical algorithms.
Figure 3: Sample Dendogram
In the context of hierarchical clustering, the hierarchy graph is called a dendogram. Figure 3 shows a sample dendogram that could be produced from a hierarchical clustering algorithm. Unlike with the kmeans algorithm, the number of clusters (k) is not specified in hierarchical clustering. After the hierarchy is built, the user can specify the number of clusters required, from 1 to n. The top level of the hierarchy represents one cluster, or k=1. To examine more clusters, we simply need to traverse down the hierarchy.
Agglomerative Hierarchical Algorithm:
Given:
A set X of objects {x_{1},...,x_{n}} A distance function dis(c_{1},c_{2})

for
i = 1 to n
c_{i} = {x_{i}} end for

C
= {c_{1},...,c_{b}}

l
= n+1

while
C.size > 1 do
a) (c_{min}_{1},c_{min}_{2}) = minimum dis(c_{i},c_{j}) for all c_{i},c_{j} in C b) remove c_{min}_{1} and c_{min}_{2} from C c) add {c_{min}_{1},c_{min}_{2}} to C d) l = l + 1 end while

Figure 4: Agglomerative Hierarchical Algorithm
Figure 4 shows a simple hierarchical algorithm. The distance function in this algorithm can determine similarity of clusters through many methods, including single link and groupaverage. Single link calculates the distance between two clusters as the shortest distance between any two objects contained in those clusters. Groupaverage first finds the average values for all objects in the group (i.e., cluster) and the calculates the distance between clusters as the distance between the average values.
Each object in X is initially used to create a cluster containing a single object. These clusters are successively merged into new clusters, which are added to the set of clusters, C. When a pair of clusters is merged, the original clusters are removed from C. Thus, the number of clusters in C decreases until there is only one cluster remaining, containing all the objects from X. The hierarchy of clusters is implicity represented in the nested sets of C.
Example:
Suppose the input to the simple agglomerative algorithm described above is the set X, shown in Figure 5 represented in matrix and graph form. We will use the Manhattan distance function and the single link method for calculating distance between clusters. The set X contains n=10 elements, x_{1} to x_{10}, where x_{1}=(0,0).
Figure 5: Sample Data
Step 1. Initially, each element x_{i} of X is placed in a cluster c_{i}, where c_{i} is a member of the set of clusters C.
C = {{x_{1}},{x_{2}},{x_{3}}, {x_{4}},{x_{5}},{x_{6}},{x_{7}}, {x_{8}},{x_{9}},{x_{10}}}
Step 2. Set l = 11.
Step 3. (First iteration of while loop) C.size = 10
·
The minimum single link distance between two clusters is 1. This occurs in two places, between c_{2} and c_{10} and between c_{3} and c_{10}.
Depending on how our minimum function works we can choose either pair of clusters. Arbitrarily we choose the first.
(c_{min}_{1},c_{min}_{2}) = (c_{2},c_{10})
·
Since l = 10, c_{11} = c_{2}
U
c
_{10} = {{x_{2}},{x_{10}}}
·
Remove c_{2} and c_{10} from C.
·
Add c_{11} to C.
C = {{x_{1}},{x_{3}}, {x_{4}},{x_{5}},{x_{6}},{x_{7}}, {x_{8}},{x_{9}},{{x_{2}}, {x_{10}}}}
·
Set l = l + 1 = 12
Step 3. (Second iteration) C.size = 9
·
The minimum single link distance between two clusters is 1. This occurs between, between c_{3} and c_{11} because the distance between x_{3} and x_{10} is 1, where x_{10} is in c_{11}.
(c_{min}_{1},c_{min}_{2}) = (c_{3},c_{11})
·
c
_{
12
}
= c_{3}
U
c
_{11} = {{{x_{2}},{x_{10}}},{x_{3}}}
·
Remove c_{3} and c_{11} from C.
·
Add c_{12} to C.
C = {{x_{1}}, {x_{4}},{x_{5}},{x_{6}},{x_{7}}, {x_{8}},{x_{9}},{{{x_{2}}, {x_{10}}}, {x_{3}}}}
·
Set l = 13
Step 3. (Third iteration) C.size = 8
·
(c_{min}_{1},c_{min}_{2}) = (c_{1},c_{12})
·
C
= {{x_{4}},{x_{5}},{x_{6}},{x_{7}}, {x_{8}},{x_{9}},{{{{x_{2}}, {x_{10}}}, {x_{3}}},{x_{1}}}}
Step 3. (Fourth iteration) C.size = 7
·
(c_{min}_{1},c_{min}_{2}) = (c_{4},c_{8})
·
C
= {{x_{5}},{x_{6}},{x_{7}}, {x_{9}},{{{{x_{2}}, {x_{10}}}, {x_{3}}},{x_{1}}},{{x_{4}},{x_{8}}}}
Step 3. (Fifth iteration) C.size = 6
·
(c_{min}_{1},c_{min}_{2}) = (c_{5},c_{7})
·
C
= {{x_{6}}, {x_{9}},{{{{x_{2}}, {x_{10}}}, {x_{3}}},{x_{1}}},{{x_{4}},{x_{8}}}, {{x_{5}},{x_{7}}}}
Step 3. (Sixth iteration) C.size = 5
·
(c_{min}_{1},c_{min}_{2}) = (c_{9},c_{13})
·
C
= {{x_{6}}, {{x_{4}},{x_{8}}}, {{x_{5}},{x_{7}}},{{{{{x_{2}}, {x_{10}}}, {x_{3}}},{x_{1}}},{x_{9}}}}
Step 3. (Seventh iteration) C.size = 4
·
(c_{min}_{1},c_{min}_{2}) = (c_{6},c_{15})
·
C
= {{{x_{4}},{x_{8}}}, {{{{{x_{2}}, {x_{10}}}, {x_{3}}},{x_{1}}},{x_{9}}},{{x_{6}}, {{x_{5}},{x_{7}}}}}
Step 3. (Eighth iteration) C.size = 3
·
(c_{min}_{1},c_{min}_{2}) = (c_{14},c_{16})
·
C
= { {{x_{6}}, {{x_{5}},{x_{7}}}}, {{{x_{4}},{x_{8}}}, {{{{{x_{2}}, {x_{10}}}, {x_{3}}},{x_{1}}},{x_{9}}}}}
Step 3. (Ninth iteration) C.size = 2
·
(c_{min}_{1},c_{min}_{2}) = (c_{17},c_{18})
·
C
= {{{{x_{4}},{x_{8}}}, {{{{{x_{2}}, {x_{10}}}, {x_{3}}},{x_{1}}},{x_{9}}}}, {{x_{6}}, {{x_{5}},{x_{7}}}}}
Step 3. (Tenth iteration) C.size = 1. Algorithm done.
The cluster created from this algorithm can be seen in Figure 6. The corresponding dendogram formed from the hierarchy in C is shown in Figure 7. The points which appeared most closely together on the graph of input data in Figure 5 are grouped together more closely in the hierarchy.
Figure 6: Graph with Clusters
Figure 7: Sample Dendogram
凡是有该标志的文章，都是该blog博主Caoer（草儿）原创，凡是索引、收藏
、转载请注明来处和原文作者。非常感谢。