优化包含多个属性和字典查找的Python代码

3 投票

4 回答

1540 浏览

提问于 2025-04-15 21:15

我写了一个Python程序，这个程序在查找对象的属性和字典的键值时花费了很多时间。我想知道有没有办法优化这些查找时间，可能通过C扩展来减少执行时间，或者我是否需要把程序重新用编译语言实现。

这个程序使用图来实现一些算法。在我们的数据集上运行得非常慢，所以我用一个可以完成的较小数据集对代码进行了性能分析，使用了cProfile。大部分时间都花在了一个函数里，特别是在这个函数中的两个生成器表达式上：

第202行的生成器表达式是

    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)

第204行的生成器表达式是

    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)

下面提供了这个函数的源代码以供参考。

selected_nodes是一个set，包含了interaction_graph中的节点，而interaction_graph是一个NetworkX Graph实例。node_neighbors是来自Graph.neighbors_iter()的一个迭代器。

Graph本身使用字典来存储节点和边。它的Graph.node属性是一个字典，存储节点及其属性（例如，'weight'）在每个节点的字典中。

每次查找理论上应该是常数时间（也就是O(1)），但我仍然在查找上付出了很大的代价。有没有什么方法可以加速这些查找（例如，通过将部分代码写成C扩展），还是说我需要把程序转移到编译语言上？

下面是提供上下文的函数的完整源代码；大部分执行时间都花在这个函数里。

def calculate_node_z_prime(
        node,
        interaction_graph,
        selected_nodes
    ):
    """Calculates a z'-score for a given node.

    The z'-score is based on the z-scores (weights) of the neighbors of
    the given node, and proportional to the z-score (weight) of the
    given node. Specifically, we find the maximum z-score of all
    neighbors of the given node that are also members of the given set
    of selected nodes, multiply this z-score by the z-score of the given
    node, and return this value as the z'-score for the given node.

    If the given node has no neighbors in the interaction graph, the
    z'-score is defined as zero.

    Returns the z'-score as zero or a positive floating point value.

    :Parameters:
    - `node`: the node for which to compute the z-prime score
    - `interaction_graph`: graph containing the gene-gene or gene
      product-gene product interactions
    - `selected_nodes`: a `set` of nodes fitting some criterion of
      interest (e.g., annotated with a term of interest)

    """
    node_neighbors = interaction_graph.neighbors_iter(node)
    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)
    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)
    try:
        max_z_score = max(neighbor_z_scores)
    # max() throws a ValueError if its argument has no elements; in this
    # case, we need to set the max_z_score to zero
    except ValueError, e:
        # Check to make certain max() raised this error
        if 'max()' in e.args[0]:
            max_z_score = 0
        else:
            raise e

    z_prime = interaction_graph.node[node]['weight'] * max_z_score
    return z_prime

根据cProfiler的分析，这里是按时间排序的前几个调用。

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
156067701  352.313    0.000  642.072    0.000 bpln_contextual.py:204(<genexpr>)
156067701  289.759    0.000  289.759    0.000 bpln_contextual.py:202(<genexpr>)
 13963893  174.047    0.000  816.119    0.000 {max}
 13963885   69.804    0.000  936.754    0.000 bpln_contextual.py:171(calculate_node_z_prime)
  7116883   61.982    0.000   61.982    0.000 {method 'update' of 'set' objects}

性能分析编译语言字典查找 c扩展生成器表达式优化网络算法常数时间

4 个回答

试着直接访问字典，然后捕获可能出现的KeyErrors（键错误），这样做可能会更快，具体要看你命中和未命中的比例：

# cache this object
ignode = interaction_graph.node
neighbor_z_scores = []
for neighbor in node_neighbors:
    try:
        neighbor_z_scores.append(ignode[neighbor]['weight'])
    except KeyError:
        pass

或者可以使用getdefault和列表推导式来实现：

sentinel = object()
# cache this object 
ignode = interaction_graph.node

neighbor_z_scores = (ignode[neighbor]['weight'] for neighbor in node_neighbors)
# using identity testing, it's slightly faster
neighbor_z_scores = (neighbor for neighbor in neighbor_z_scores if neighbor is not sentinel)

回答于 2025-04-15 由 Python大师

分享举报

你觉得把interaction_graph.neighbors_iter(node)的遍历顺序保持有序（或者用collections.heapq部分有序）怎么样？因为你只是想找最大值，所以可以按降序遍历node_neighbors，第一个在selected_node里的节点肯定是最大的。

其次，selected_node变化的频率有多高？如果变化不频繁，你可以通过保存一个“interaction_graph.node[neighbor] for x in selected_node”的列表来节省很多遍历时间，而不是每次都重新构建这个列表。

编辑：回复评论

排序需要O(n log n)的时间。

不一定，你可能太看重课本上的说法了。尽管课本上这么说，但你有时可以通过利用数据的某种结构来打破O(n log n)的限制。如果你一开始就把邻居列表保存在一个自然有序的数据结构中（比如heapq或二叉树），那么在每次迭代时就不需要重新排序。当然，这是一种空间和时间的权衡，因为你需要存储冗余的邻居列表，并且需要一些代码复杂度来确保当邻居变化时，邻居列表也会更新。

另外，Python的list.sort()使用的是timsort算法，对于几乎有序的数据来说非常快（在某些情况下平均可以达到O(n)）。不过，它仍然无法打破O(n log n)，这一点已经被数学证明是不可能的。

在否定一个解决方案可能不会提高性能之前，你需要先进行性能分析。在进行极端优化时，你可能会发现，在某些非常特殊的边界情况下，老旧且慢的冒泡排序可能会胜过华丽的快速排序或归并排序。

回答于 2025-04-15 由 Python大师

分享举报

我不明白为什么你的“权重”查找必须用 ["weight"] 这种形式（节点是字典？）而不是 .weight（节点是对象）。

如果你的节点是对象，并且字段不多，你可以利用这个 __slots__ 指令来优化它们的存储：

class Node(object):
    # ... class stuff goes here ...

    __slots__ = ('weight',) # tuple of member names.

编辑：我看了你提供的 NetworkX 链接，有几个地方让我感到困惑。首先，在最上面，“字典”的定义是“FIXME”。

总体来看，它似乎坚持使用字典，而不是可以被继承的类来存储属性。虽然在对象上查找属性可能本质上是字典查找，但我不明白使用对象会 更糟糕 的原因。如果有什么不同，使用对象可能会更好，因为：

对象属性查找是非常常见的，
对象属性的键空间比字典键要小得多，因此可以在搜索中使用更优化的比较算法，
对象有 __slots__ 优化，正好适用于这种情况，即你有一个只有几个字段的对象，并且需要优化访问。

例如，我经常在表示坐标的类上使用 __slots__。树节点对我来说似乎也是一个明显的用例。

所以当我看到：

节点
节点可以是任何可哈希的 Python 对象，除了 None。

我想，没问题，但紧接着就是

节点属性
节点可以通过在添加节点或为指定节点 n 分配 G.node[n] 属性字典时使用关键字/值对来分配任意 Python 对象作为属性。

我在想，如果一个节点需要属性，为什么要单独存储？为什么不直接放在节点里？写一个包含 contentString 和 weight 成员的类有什么坏处吗？边缘似乎更疯狂，因为它们被规定为元组，而不是可以被继承的对象。

所以我对 NetworkX 背后的设计决策感到相当困惑。

如果你被迫使用它，我建议把属性从那些字典中移到实际的节点里，或者如果这不行，使用整数作为属性字典的键，而不是字符串，这样搜索时可以使用更快的比较算法。

最后，如果你把生成器结合起来呢：

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in node_neighbors if neighbor in selected_nodes)

回答于 2025-04-15 由 Python大师

分享举报

优化包含多个属性和字典查找的Python代码

4 个回答

撰写回答