# 人在江湖

BlogJava :: 首页 :: 联系 :: 聚合  :: 管理

Data Mining Concepts and Techniques书里对决策树的构建过程阐述的很清晰，书可以在之前的博客找到： http://www.blogjava.net/vcycyv/archive/2011/09/05/357967.html

(1)    create a node N;
(2)    if tuples in D are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4)    if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6)    apply Attribute selection method(D, attribute list) to ﬁnd the “best” splitting criterion;
(7)    label node N with splitting criterion;
(8)    if splitting attribute is discrete-valued and
multiway splits allowed then // not restricted to binary trees
(9) attribute list = attribute list - splitting attribute; // remove splitting attribute
(10)  for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
(11)         let D j be the set of data tuples in D satisfying outcome j; // a partition
(12)         if D j is empty then
(13) attach a leaf labeled with the majority class in D to node N;
(14)         else attach the node returned by Generate decision tree(D j , attribute list) to node N;
endfor
(15)  return N;

information gain:

information gain就可以这样算：

The class label attribute, buys computer, has two distinct values (namely, {yes, no}); therefore, there are two distinct
classes (that is, m = 2). Let class C1 correspond to yes and class C2 correspond to no.There are nine tuples of class yes and ﬁve tuples of class no. A (root) node N is createdfor the tuples in D. To ﬁnd the splitting criterion for these tuples, we must compute the information gain of each attribute. We ﬁrst compute the expected information needed to classify a tuple in D:

Next, we need to compute the expected information requirement for each attribute. Let’s start with the attribute age. We need to look at the distribution of yes and no tuples for each category of age. For the age category youth, there are two yes tuples and three no tuples. For the category middle aged, there are four yes tuples and zero no tuples. For the category senior, there are three yes tuples and two no tuples.

Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and Gain(credit rating) = 0.048 bits. Because age has the highest information gain among the attributes, it is selected as the splitting attribute.

Gain ratio简述：

information gain在处理多值属性的时候效果不好，比如如果有一个属性是product_id,那么经过他分所有对象之后，每个对象自成一组，也就是说每个组都是pure的，所以分组后的info（D）就是0，所以用product_id分组自然gain的值最大，但是显然这样分组没意义。Gain ratio相当于调整了information gain, 它用比值来计算而不是减法。具体在书里有例子，不详述。

Gini index:

Gini index是用来算impurity of D的。上面说过，这种算法是binary split的。举个binary split的例子：

if income has three possible values, namely {low,
medium, high}, then the possible subsets are {low, medium, high}, {low, medium}, {low,high}, {medium, high}, {low}, {medium}, {high}, and {}. We exclude the power set,{low, medium, high}, and the empty set from consideration since, conceptually, they do not represent a split.

there are nine tuples belonging to the class buys computer = yes and the remaining ﬁve tuples belong to the class buys computer = no.

To ﬁnd the splitting criterion for the tuples in D, we need to compute the gini index for each attribute. Let’s start with the attribute income and consider each of the possible splitting subsets. Consider the subset {low, medium}. This would result in 10 tuples in partition D1 satisfying the condition “income 属于{low, medium}.” The remaining four tuples of D would be assigned to partition D2 . The Gini index value computed based on this partitioning is

preprune就是一边split tree，一边算着统计量，比如information gain, 或者gini index之类的，一旦算出来的值够不到threshold,就停下来变成叶子节点。

postprune就是把tree grow完了之后，再从底向上剪掉一些分支。剪掉分支的算法叫cost complexity, 它主要基于Leaf的个数和error rate. 就是算一个node如果prune它和保留它之间哪个cost complexisty更大，如果prune之后cost complexity更小了，就prune它。注意算cost complexity要基于一个单独的data set, 不用training dataset, 也不用validation data set.

Data Mining techniques for Marketing Sales and Customer Relationship Management这本书提出了两个需要注意的地方：

1. 经过某次split之后的生成两个leaf节点，他们可能是同一个category的。只是概率不一样，但是都过了threshold.

2. binary tree可以分target是多个category的。反过来，多值tree也可以给binary target的分类。

posted on 2011-09-18 23:27 人在江湖 阅读(1881) 评论(1)  编辑  收藏 所属分类: BI

### Feedback

# re: 决策树 2011-09-19 08:53 tb

 只有注册用户登录后才能发表评论。 网站导航: 相关文章: