Python：优化长累积和

Question

我有一个程序，它处理一大堆实验数据。这些数据以对象的列表形式存储，这些对象是一个类的实例，包含以下属性：

time_point - 样本的时间
cluster - 样本来源的节点集群名称
node - 样本来源的节点名称
qty1 - 第一个数量的样本值
qty2 - 第二个数量的样本值

我需要从这些数据中计算一些值，分成三种方式：一次是针对整个样本，一次是针对每个节点集群，还有一次是针对每个节点。我需要计算的值依赖于（按时间排序的）qty1和qty2的累积和：累积和的逐项相加的最大值、这个最大值出现的时间点，以及在那个时间点的qty1和qty2的值。

我想出了以下解决方案：

dataset.sort(key=operator.attrgetter('time_point'))

# For the whole set
sys_qty1 = 0
sys_qty2 = 0
sys_combo = 0
sys_max = 0

# For the cluster grouping
cluster_qty1 = defaultdict(int)
cluster_qty2 = defaultdict(int)
cluster_combo = defaultdict(int)
cluster_max = defaultdict(int)
cluster_peak = defaultdict(int)

# For the node grouping
node_qty1 = defaultdict(int)
node_qty2 = defaultdict(int)
node_combo = defaultdict(int)
node_max = defaultdict(int)
node_peak = defaultdict(int)

for t in dataset:
  # For the whole system ######################################################
  sys_qty1 += t.qty1
  sys_qty2 += t.qty2
  sys_combo = sys_qty1 + sys_qty2
  if sys_combo > sys_max:
    sys_max = sys_combo
    # The Peak class is to record the time point and the cumulative quantities
    system_peak = Peak(time_point=t.time_point,
                       qty1=sys_qty1,
                       qty2=sys_qty2)
  # For the cluster grouping ##################################################
  cluster_qty1[t.cluster] += t.qty1
  cluster_qty2[t.cluster] += t.qty2
  cluster_combo[t.cluster] = cluster_qty1[t.cluster] + cluster_qty2[t.cluster]
  if cluster_combo[t.cluster] > cluster_max[t.cluster]:
    cluster_max[t.cluster] = cluster_combo[t.cluster]
    cluster_peak[t.cluster] = Peak(time_point=t.time_point,
                                   qty1=cluster_qty1[t.cluster],
                                   qty2=cluster_qty2[t.cluster])
  # For the node grouping #####################################################
  node_qty1[t.node] += t.qty1
  node_qty2[t.node] += t.qty2
  node_combo[t.node] = node_qty1[t.node] + node_qty2[t.node]
  if node_combo[t.node] > node_max[t.node]:
    node_max[t.node] = node_combo[t.node]
    node_peak[t.node] = Peak(time_point=t.time_point,
                             qty1=node_qty1[t.node],
                             qty2=node_qty2[t.node])

这个方法能产生正确的结果，但我在想是否可以让它更易读/更符合Python的风格，或者让它运行得更快/更具扩展性。

上面的代码的优点是只遍历了一次（很大的）数据集，但缺点是我基本上复制粘贴了三遍相同的算法。

为了避免上面的复制粘贴问题，我还尝试了这个：

def find_peaks(level, dataset):

  def grouping(object, attr_name):
    if attr_name == 'system':
      return attr_name
    else:
      return object.__dict__[attrname]

  cuml_qty1 = defaultdict(int)
  cuml_qty2 = defaultdict(int)
  cuml_combo = defaultdict(int)
  level_max = defaultdict(int)
  level_peak = defaultdict(int)

  for t in dataset:
    cuml_qty1[grouping(t, level)] += t.qty1
    cuml_qty2[grouping(t, level)] += t.qty2
    cuml_combo[grouping(t, level)] = (cuml_qty1[grouping(t, level)] +
                                      cuml_qty2[grouping(t, level)])
    if cuml_combo[grouping(t, level)] > level_max[grouping(t, level)]:
      level_max[grouping(t, level)] = cuml_combo[grouping(t, level)]
      level_peak[grouping(t, level)] = Peak(time_point=t.time_point,
                                            qty1=node_qty1[grouping(t, level)],
                                            qty2=node_qty2[grouping(t, level)])
  return level_peak

system_peak = find_peaks('system', dataset)
cluster_peak = find_peaks('cluster', dataset)
node_peak = find_peaks('node', dataset)

对于系统级的计算，我还想出了这个，看起来不错：

dataset.sort(key=operator.attrgetter('time_point'))

def cuml_sum(seq):
  rseq = []
  t = 0
  for i in seq:
    t += i
    rseq.append(t)
  return rseq

time_get = operator.attrgetter('time_point')
q1_get = operator.attrgetter('qty1')
q2_get = operator.attrgetter('qty2')

timeline = [time_get(t) for t in dataset]
cuml_qty1 = cuml_sum([q1_get(t) for t in dataset])
cuml_qty2 = cuml_sum([q2_get(t) for t in dataset])
cuml_combo = [q1 + q2 for q1, q2 in zip(cuml_qty1, cuml_qty2)]

combo_max = max(cuml_combo)
time_max = timeline.index(combo_max)
q1_at_max = cuml_qty1.index(time_max)
q2_at_max = cuml_qty2.index(time_max)

然而，尽管这个版本很酷，使用了列表推导和zip()，它还是要遍历数据集三次，仅仅为了系统级的计算，我想不到一个好的方法来进行集群级和节点级的计算，而不做一些比较慢的事情，比如：

timeline = defaultdict(int)
cuml_qty1 = defaultdict(int)
#...etc.

for c in cluster_list:
  timeline[c] = [time_get(t) for t in dataset if t.cluster == c]
  cuml_qty1[c] = [q1_get(t) for t in dataset if t.cluster == c]
  #...etc.

这里有没有人能给出改进的建议？上面的第一段代码在我的初始数据集（大约一百万条记录）上运行得很好，但后来的数据集会有更多的记录和集群/节点，所以扩展性是个问题。

这是我第一次非简单地使用Python，我想确保我能好好利用这个语言（这段代码是替代一组非常复杂的SQL查询，而早期版本的Python代码基本上是对SQL的低效直接翻译）。我通常不怎么编程，所以可能会漏掉一些基础的东西。

非常感谢！

数据处理算法优化性能提升可读性数据集集群计算累积和节点计算

Python：优化长累积和

1 个回答

撰写回答