在pandas中快速稀疏向量相加

Question

这是一个关于如何在Pandas中进行大量表连接以进行向量数学运算的问题。

通过一个非常非常长的处理过程，我把大量的数据（以HDF5表格的形式表示）处理成了大约20个稀疏向量，这些向量用Pandas的DataFrame表示，并且有基于字符串的多重索引。这些向量所在的空间非常复杂且维度很高（这是自然语言数据），但它们之间有一些重叠。每个向量大约有5K到60K个维度，而重叠的维度总数（根据我调用的20个向量可能会有所不同）大约是20万。（实际上，整个空间的维度远远超过20万！）

到这里为止，处理速度非常快，只需要一次性将表格处理成合适的向量。

但现在我想对这些向量进行对齐和求和。我找到的所有解决方案都比较慢。我正在使用Python 2.7上的Pandas 0.12.0。

设A为我获取向量的存储位置/磁盘。

In [106]: nounlist = ["fish-n", "bird-n", "ship-n", "terror-n", "daughter-n", "harm-n", "growth-n", "reception-n", "antenna-n", "bank-n", "friend-n", "city-n", "woman-n", "weapon-n", "politician-n", "money-n", "greed-n", "law-n", "sympathy-n", "wound-n"]

In [107]: matrices = [A[x] for x in nounlist]

（我意识到matrices这个词有点误导。除了多重索引，它们其实只有一列。）

到目前为止一切顺利。但现在我想把它们连接起来，以便可以求和：

In [108]: %timeit matrices[0].join(matrices[1:], how="outer")
1 loops, best of 3: 18.2 s per loop

这是在一个相对较新的处理器上（2.7 GHz AMD Opteron）。对于理想情况下在语音处理系统中使用的高维度数据来说，这个速度太慢了。

我用reduce稍微好了一点：

In [109]: %timeit reduce(lambda x, y: x.join(y, how="outer"), matrices[1:], matrices[0])
1 loops, best of 3: 10.8 s per loop

这些在多次运行中保持相对一致。一旦返回，求和的速度就可以接受了：

In [112]: vec = reduce(lambda x, y: x.join(y, how="outer"), matrices[1:], matrices[0])

In [113]: %timeit vec.T.sum()
1 loops, best of 3: 262 ms per loop

我接近将时间缩短到合理范围的结果是这个：

def dictcutter(mlist):
    rlist = [x.to_dict()[x.columns[0]] for x in mlist]
    mdict = {}
    for r in rlist:
        for item in r:
            mdict[item] = mdict.get(item, 0.0) + r[item]
    index = pd.MultiIndex.from_tuples(mdict.keys())
    return pd.DataFrame(mdict.values(), index=index)

这个运行起来是：

In [114]: %timeit dictcutter(matrices)
1 loops, best of 3: 3.13 s per loop

但每一秒都很重要！有没有办法进一步缩短时间？有没有更聪明的方法按维度来加这些向量？

编辑补充 Jeff在评论中请求的细节：

关于“fish-n”向量的一些细节：

In [14]: vector = A['fish-n']

In [15]: vector.head()
Out[15]: 
                   fish-n
link   word1             
A2     give-v  140.954675
A4     go-v    256.313976
AM-CAU go-v      0.916041
AM-DIR go-v     29.022072
AM-MNR go-v     21.941577

In [16]: vector.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 5424 entries, (A2, give-v) to (A1, gotta-v)
Data columns (total 1 columns):
fish-n    5424  non-null values
dtypes: float64(1)

深入挖掘：

In [17]: vector.loc['A0']
Out[17]: 
<class 'pandas.core.frame.DataFrame'>
Index: 1058 entries, isolate-v to overdo-v
Data columns (total 1 columns):
fish-n    1058  non-null values
dtypes: float64(1)

In [18]: vector.loc['A0'][500:520]
Out[18]: 
                 fish-n
word1                  
whip-v         3.907307
fake-v         0.117985
sip-v          0.579624
impregnate-v   0.885079
flavor-v       5.583664
inspire-v      2.251709
pepper-v       0.967941
overrun-v      1.435597
clutch-v       0.140110
intercept-v   20.513823
refined-v      0.738980
gut-v          7.570856
ascend-v      12.686698
submerge-v     1.761342
catapult-v     0.577075
cleaning-v     1.492284
floating-v     5.318519
incline-v      2.270102
plummet-v      0.243116
propel-v       3.957041

现在把这个乘以20，然后尝试把它们全部求和……

性能优化数据处理 pandas 多重索引表连接高维数据稀疏向量向量数学

在pandas中快速稀疏向量相加

1 个回答

撰写回答