无为

无为则可为，无为则至深！

:: 管理

190 Posts :: 291 Stories :: 258 Comments :: 0 Trackbacks

Scaling Clustering Algorithms to Large Databases

Clustering is an important application area for many fields including data mining [FPSU96], statistical data analysis [KR89,BR93,FHS96], compression [ZRL97], vector quantization, and other business applications [B*96]. Clustering has been formulated in various ways in the machine learning [F87], pattern recognition [DH73,F90], optimization [BMS97,SI84], and statistics literature [KR89,BR93,B95,S92,S86]. The fundamental clustering problem is that of grouping together (clustering) similar data items.

The most general approach is to view clustering as a density estimation problem [S86, S92, BR93]. We assume that in addition to the observed variables for each data item, there is a hidden, unobserved variable indicating the “cluster membership”. The data is assumed to arrive from a mixture model with hidden cluster identifiers. In general, a mixture model M having K clusters Ci, i=1,…,K, assigns a probability to a data point x: where Wi are the mixture weights. The problem is estimating the parameters of the individual Ci, assuming that the number of clusters K is known. The clustering optimization problem is that of finding parameters of the individual Ci which maximize the likelihood of the database given the mixture model. For general assumptions on the distributions for each of the K clusters, the EM algorithm [DLR77, CS96] is a popular technique for estimating the parameters. The assumptions addressed by the classic K-Means algorithm are: 1) each cluster can be effectively modeled by a spherical Gaussian distribution, 2) each data item is assigned to 1 cluster, 3) the mixture weights (Wi) are assumed equal. Note that KMeans [DH73, F90] is only defined over numeric (continuous-valued) data since the ability to compute the mean is required. A discrete version of K-Means exists and is sometimes referred to as harsh EM. The K-Mean algorithm finds a locally optimal solution to the problem of minimizing the sum of the L2 distance between each data point and its nearest cluster center (“distortion”) [SI84], which is equivalent to a maximizing the likelihood given the assumptions listed.

There are various approaches to solving the optimization problem. The iterative refinement approaches, which include EM and K-Means, are the most effective. The basic algorithm is as follows: 1) Initialize the model parameters, producing a current model, 2) Decide memberships of the data items to clusters, assuming that the current model is correct, 3) Re-estimate the parameters of the current model assuming that the data memberships obtained in 2 are correct, producing new model, 4) If current model and new model are sufficiently close to each other, terminate, else go to 2).

K-Means parameterizes cluster Ci by the mean of all points in that cluster, hence the model update step 3) consists of computing the mean of the points assigned to a given cluster. The membership step 2) consists of assigning data points to the cluster with the nearest mean measured in the L2 metric.

We focus on the problem of clustering very large databases, those too large to be “loaded” in RAM. Hence the data scan at each iteration is extremely costly. We focus on the KMeans algorithm although the method can be extended to accommodate other algorithms [BFR98]. K-Means is a well-known algorithm, originally known as Forgy’s method [F65,M67] and has been used extensively in pattern recognition [DH73, F90]. It is a standard technique used in a wide array of applications, even as a way to initialize more expensive EM clustering [B95,CS96,MH98,FRB98, BF98].

凡是有该标志的文章，都是该blog博主Caoer（草儿）原创，凡是索引、收藏
、转载请注明来处和原文作者。非常感谢。

posted on 2006-06-24 13:54 草儿阅读(203) 评论(0) 编辑收藏所属分类: BI and DM

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: Capacity Planning for the Data Warehouse Environment Bill Gates Is A Failure cognos开发心得 WEB数据仓库系统层次结构轻量级数据仓库移动电信数据仓库设计六环节数据挖掘--极具发展前景的新领域智能学习资源最佳JAVA BI工具专家观点：数据挖掘的本质

无为

公告

随笔分类(222)

随笔档案(188)

相册

收藏夹(6)

AJAX

DB BI DM

ＪＡＶＡ编程论坛

ＵＭＬ技术论坛

搜索

积分与排名

最新评论

阅读排行榜