理论树
1.决策树的模型与学习:模型、决策树与if-then规则、条件概率分布以及决策树学习
2.决策树的特征选择:定义特征选择问题,依靠信息增益或信息增益比
3.决策树的生成:
ID3算法:仅有树的生成
C4.5算法:使用了信息增益比的ID3
4.决策树的剪枝:依据带惩罚项的损失函数最小
5.决策树的CART算法(最终):
生成分为:回归树的生成——最小二乘和划分空间
分类树的生成——使用了基尼指数的ID3
剪枝:子树序列
课后题
1.根据表5.1所给的训练数据集,利用信息增益比(C4.5算法)生成决策树。
答:
导包和数据集部分
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import graphviz
features = ["年龄", "有工作", "有自己的房子", "信贷情况"]
x = pd.DataFrame([
["青年", "否", "否", "一般"],
["青年", "否", "否", "好"],
["青年", "是", "否", "好"],
["青年", "是", "是", "一般"],
["青年", "否", "否", "一般"],
["中年", "否", "否", "一般"],
["中年", "否", "否", "好"],
["中年", "是", "是", "好"],
["中年", "否", "是", "非常好"],
["中年", "否", "是", "非常好"],
["老年", "否", "是", "非常好"],
["老年", "否", "是", "好"],
["老年", "是", "否", "好"],
["老年", "是", "否", "非常好"],
["老年", "否", "否", "一般"]
], columns=features)
y = pd.DataFrame(["否", "否", "是", "是", "否",
"否", "否", "是", "是", "是",
"是", "是", "是", "是", "否"])
class_names = [str(k) for k in np.unique(y)]
预处理
x = pd.get_dummies(x)
features = list(x.columns)
训练以及用graphviz画图
model_tree = DecisionTreeClassifier()
model_tree.fit(x, y)
dot_data = tree.export_graphviz(model_tree, out_file=None,
feature_names=x.columns,
class_names=class_names,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
tree_text = tree.export_text(model_tree, feature_names=features)
print(tree_text)
2.已知如表5.2所示的训练数据,试用平方误差损失准则生成一个二叉回归树。
答:
仅会简洁实现
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
x = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]).T
y = np.array([4.50, 4.75, 4.91, 5.34,
5.80, 7.05, 7.90, 8.23, 8.70, 9.00])
rtree = DecisionTreeRegressor(max_depth=3)
rtree.fit(x, y)
x_dot = np.arange(0, 11, 0.01).reshape(-1, 1)
y_line = rtree.predict(x_dot)
plt.figure()
plt.scatter(x, y)
plt.plot(x_dot, y_line)
plt.title('DecisionTreeRegressor');