递归树构建中的无限递归

Question

我遇到了一个有趣的问题，跟我写的一个递归的Python函数有关。我可能漏掉了一些细微的基本情况！下面的代码目的是构建一个决策树（用来判断数据趋势的结构）。这个树是通过嵌套Python列表来实现的，返回值和result列表就是在做这个。结构本身似乎不是特别重要。

 def tdidt(dataset,attr_list,class_index,class_labels):
    print attr_list
    
    #base case 1: all the same class
    if all_same_class(dataset, class_index):
        class_label = majority_vote(dataset, class_index)
        return ['c',class_label]
    #base case 2: no more attributes
    elif attr_list == []:
        print 'i see and empty list!!!'
        class_label = majority_vote(dataset, class_index)
        return ['c',class_label]
    else:
        #else select attribute and partition dataset 
        attr = select_attribute(dataset, list(attr_list), class_index, class_labels)
        partitions = partition_on_attr(dataset, attr)
    
        #base case 3: one of the partitions is empty
        if has_empty_partition(partitions):
            class_label = majority_vote(dataset, class_index)
            return ['c',class_label]
        else:
            #enable respective data attribute labels for easier to read tree
            #result = result + ['a',titanic_header[attr]]
            #result = result + ['a',auto_header[attr]]
    
            #add attribute node to tree
            result = ['a',attr]
            #remove used attribute
            removed_attr_list = [x for x in list(attr_list) if x != attr]
    
            for partition in partitions.values():
                #add value node for tree, recursive call for next attributes
                result.append(["v",partition[0][attr],tdidt(partition, list(removed_attr_list), class_index, class_labels)]) 
        return result

attr_list（或者跟attr_list有关的东西）看起来是最可能出问题的地方（请看我下面的抱怨和错误信息）。attr_list包含了一系列属性的索引，我会一个一个地从列表中选择这些属性。

这个递归函数有三个基本情况：

1（全是同一类）：

数据集中每一行（属性列表）在class_index索引的行中都有一个共同的值。

2（没有更多属性）：

参数attr_list是空的。这应该是最常见的基本情况。

3（一个分区是空的）：

数据通过partition_on_attr()进行分区，其中一个分区（行列表）是空的，无法继续。

递归调用的那一行是最长的（导致水平滚动条的那一行），在这里调用了tdidt()函数，作为附加列表的一部分。赋值给removed_attr_list的列表推导的目的是生成一个不包含刚刚从attr_list中选择的属性的属性列表。

好吧，关于我遇到的错误，情况很严重。我达到了递归的极限，而且原因很奇怪。我在函数调用时立即打印attr_list，并且每当出现空列表时也会打印。输出是这样的……

[0, 1, 2]
[0, 1]
[1]
[]
i see and empty list!!!
[] 
i see and empty list!!!
[1]
[]
i see and empty list!!!
[]
[1]
[1]
[1]
[1] 
[1]
[1]
.
.
.

重点在于省略号。 [1]的输出不断出现，直到递归达到最大限制错误。

RuntimeError: maximum recursion depth exceeded in cmp

一开始的行为正是我预期的；attr_list被用完了，从递归调用的for循环中创建了几个分支，然后基本情况出现了，提示“我看到一个空列表！”，接着[1]的输出开始流出！

我是不是漏掉了某个递归情况？attr_list是怎么又变回[1]并且一直保持这个状态的？任何帮助都非常感谢，感谢你能看到这篇长文的最后！

-----------------------------更新 2014年10月24日-----------------------------

在每次递归调用时打印dataset变量，我发现它的行为和attr_list参数类似。事情按预期进行，然后dataset参数变成了这一行，不断重复……

[['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'yes'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'yes'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'yes'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no']]

这只有一行！（还有一些数据的奇怪混乱）

这到底是怎么发生的？这个函数似乎在完成它的过程后会出现某种除零错误……我的递归一定有什么问题。

-----------------------------更新 2014年10月25日-----------------------------

好吧，我们有了新的发现。@njlarsson建议检查当函数partition_on_attr()返回与接收的相同数据集时的行为，这意味着所选属性在整个数据集中是一致的，没有进行分区。我最初的想法是下一个递归调用会有一个更小的attr_list，最终空的attr_list基本情况会捕捉到这个分区。我显然错了，这里有几个我运行的基本测试案例和它们的共同点……

第一个测试数据集有两个实例，前面三个属性是要进行分割的属性，第四个是我们要猜测的属性。由于所有的分割属性都不相等，因此partition_on_attr()函数返回整个数据集作为一个分区的情况是不会发生的。

play = [[0,0,0,1],[1,1,1,0]]

输出是一个很棒的决策树（虽然不太好看，但这不是重点）

['a', 0, ['v', 0, ['c', 1]], ['v', 1, ['c', 0]]]

现在我们试一个不同的测试数据集，跟第一个测试集类似 - 也是两个实例，结构相同。

play = [[1,1,1,1],[1,1,1,0]]

在这个数据集中，前面三个属性，即要进行分割的属性，都是相等的。因此partition_by_attr()函数必须返回一个与传入数据集相同的单一分区。最大递归错误发生了！

@njlarsson，你真是发现了关键！我只是不明白为什么会这样。在partition_on_attr函数返回整个数据集作为一个分区的情况下，算法应该怎么做？

暂无标签

递归树构建中的无限递归

1（全是同一类）：

2（没有更多属性）：

3（一个分区是空的）：

2 个回答

撰写回答