Python, PyTables - 利用内核搜索的优势

Question

我有一些HDF5文件，这些文件里有多个组，每个组里面都有一个数据集，行数超过2500万。在每次模拟的时间步骤中，每个代理会输出他/她在那个时间点感知到的其他代理。这个场景大约有2000个代理，并且有成千上万的时间步骤；输出的数量级是O(n^2)，这就解释了为什么行数会这么多。

我想计算的是按类别统计的独特感知次数。比如，代理分为红色、蓝色或绿色。我想做一个二维表格，表格的第i行第j列表示至少有一个i类别的代理感知到的j类别的代理数量。（在这个代码示例中，我用的是Sides，但我们也可以用其他方式来分类代理，比如他们持有的武器，或者他们携带的传感器。）

下面是一个输出表的示例；注意，模拟中不输出蓝色/蓝色的感知，因为这占用太多空间，而且我们对此不感兴趣。绿色/绿色也是一样的。

      blue     green      red
blue  0      492       186
green 1075    0     186
red   451    498      26

表格的列包括：

tick - 时间步骤
sensingAgentId - 进行感知的代理的ID
sensedAgentId - 被感知的代理的ID
detRange - 两个代理之间的距离（米）
senseType - 感知类型的枚举

这是我目前用来实现这个功能的代码：

def createHeatmap():
  h5file = openFile("someFile.h5")
  run0 = h5file.root.run0.detections

  # A dictionary of dictionaries, {'blue': {'blue':0, 'red':0, ...}
  classHeat = emptyDict(sides)

  # Interested in Per Category Unique Detections
  seenClass = {}

  # Initially each side has seen no one    
  for theSide in sides:
    seenClass[theSide] = []

  # In-kernel search filtering out many rows in file; in this instance 25,789,825 rows
  # are filtered to 4,409,176  
  classifications = run0.where('senseType == 3')

  # Iterate and filter 
  for row in classifications:
    sensedId = row['sensedAgentId']
    # side is a function that returns the string representation of the side of agent
    # with that id.
    sensedSide = side(sensedId)
    sensingSide = side(row['sensingAgentId'])

    # The side has already seen this agent before; ignore it
    if sensedId in seenClass[sensingSide]:
      continue
    else:
      classHeat[sensingSide][sensedSide] += 1
      seenClass[sensingSide].append(sensedId)


  return classHeat

注意：我有Java背景，所以如果这段代码不够Python风格，我很抱歉。请指出来，并建议我如何改进这段代码，我希望能更熟练地使用Python。

现在，这个过程非常慢：进行这次迭代和成员检查大约需要50秒，而这还是在最严格的成员标准下（其他检测类型的行数更多，迭代起来会更慢）。

我的问题是，是否可以把工作移出Python，放到内核搜索查询中？如果可以的话，怎么做？有没有什么明显的加速方法我没有想到？我需要在一组运行（大约30次）中为每次运行执行这个功能，并且对于多组标准（大约5组），所以如果能加快速度就太好了。

最后一点：我尝试过使用psyco，但几乎没有什么效果。

性能优化数据处理数据集 hdf5 代理模型多维表格内核搜索感知统计

Python, PyTables - 利用内核搜索的优势

1 个回答

撰写回答