有人能帮我理解为什么递归函数不起作用吗?我已经创建了一个函数来计算一个特征将被分割的信息增益,然后我在我的决策树函数中调用它。然而,决策树函数似乎只迭代了一次
IG功能是:
# create function to split dataset on maximum gain ratio
def gain_ratio_split(features, target):
from operator import itemgetter
import numpy as np
target_ent = -sum((target.value_counts()/len(target)) * np.log2((target.value_counts()/len(target))))
gainratios = []
num_cols = features.shape[1]
for i in range(num_cols):
column_name = features.columns[i]
single_column_df = pd.DataFrame(features[column_name])
duo_column_df = single_column_df.merge(target, left_index=True, right_index=True).sort_values(by=column_name)
for j in range(len(duo_column_df)):
sub_df_1 = duo_column_df.head(j)
sub_df_2 = duo_column_df.tail(len(duo_column_df)-j)
sub_df_1_ent = -sum(sub_df_1.iloc[:,1].value_counts()/len(sub_df_1) * np.log2(sub_df_1.iloc[:,1].value_counts()/len(sub_df_1)))
sub_df_2_ent = -sum(sub_df_2.iloc[:,1].value_counts()/len(sub_df_2) * np.log2(sub_df_2.iloc[:,1].value_counts()/len(sub_df_2)))
gain = target_ent - ((len(sub_df_1)/len(duo_column_df)) * sub_df_1_ent) - ((len(sub_df_2)/len(duo_column_df)) * sub_df_2_ent)
splitinfo = -( (len(sub_df_1)/len(duo_column_df)) * np.log2( (len(sub_df_1)/len(duo_column_df)) )) - \
( (len(sub_df_2)/len(duo_column_df)) * np.log2( (len(sub_df_2)/len(duo_column_df)) ))
gainratio = np.nan_to_num(gain / splitinfo)
gainratios.append((column_name, gainratio))
split_col = max(gainratios,key=itemgetter(1))[0]
split_val = max(gainratios,key=itemgetter(1))[1]
return split_col, split_val
决策树功能是:
import warnings
warnings.filterwarnings('ignore')
def dec_t(features, target, depth = 0):
depth = 0
edge = []
num_feats = features.shape[1]
if depth <= 4 and num_feats > 1:
parent = gain_ratio_split(features, target)
child1 = features[features[parent[0]] >= parent[1]]
child2 = features[features[parent[0]] < parent[1]]
# create new dataset to feed into recursive loop (removing feature previously used for splitting)
c1 = child1.loc[:, child1.columns != parent[0]]
c2 = child2.loc[:, child2.columns != parent[0]]
# get appropriate target values for new dataset
tar1 = pd.DataFrame(c1.merge(target, left_index=True, right_index=True).iloc[:,-1])
tar2 = pd.DataFrame(c2.merge(target, left_index=True, right_index=True).iloc[:,-1])
# begin recursion on left and right edges of node
left = dec_t(c1, tar1, depth + 1)
right = dec_t(c2, tar2, depth + 1)
# append splitting values to list so I can print into a tree later
edge.append(left)
edge.append(right)
depth += 1
return edge
如果我调用决策树函数dec_t(X_train_norm, y_train)
,要么返回一个空列表(edge),要么使用深度计数器(这目前是错误的,但我无法理解),我会得到以下错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-193-a790156c1013> in <module>()
----> 1 dec_t(X_train_norm, y_train)
5 frames
<ipython-input-9-0a75ea1fa270> in gain_ratio_split(features, target)
23 gainratios.append((column_name, gainratio))
24
---> 25 split_col = max(gainratios,key=itemgetter(1))[0]
26 split_val = max(gainratios,key=itemgetter(1))[1]
27
ValueError: max() arg is an empty sequence
我已经用我正在使用的数据集手动计算了IG函数的所有可能输出,并且没有一个返回空值。任何帮助都将不胜感激
经过一点努力,我终于解决了这个问题。我更改了增益比分割函数,以包括用于在上分割特征的中点值。我还包括了获取空数据帧时的异常。添加异常是一个很好的学习曲线
而实际的决策和树构建功能被更新为设置深度=深度,我也更改为如上所述的中点值拆分
相关问题 更多 >
编程相关推荐