在pandas数据帧的列上运行函数的有效方法?

2024-03-28 10:19:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在Pandas数据帧的列上运行一个函数。 语料库是pd.数据帧在

import pandas as pd 
import numpy as np
from scipy.spatial.distance import cosine

corpus = pd.DataFrame([[3,1,1,1,1,60],[2,2,0,2,0,20], [0,2,1,1,0,0], [0,0,2,1,0,1],[0,0,0,0,1,0]],index=["stark","groß","schwach","klein", "dick"],columns=["d1", "d2", "d3","d4","d5","d6"])

我有疑问。查询是熊猫系列。在

^{pr2}$

现在我想对语料库和查询中的每一列运行余弦函数。在

for column in corpus:
print("Similarity of Documents", column," and query: \n" ,1-cosine(query, corpus[column]))

有没有更好的方法在列上运行余弦函数?可能是某个方法,它获取列并在每个列上运行函数。我想避免for循环。在


Tags: 数据方法函数importnumpypandasforas
3条回答

您还可以使用cosine的定义并自己实现

pandas

corpus.T.dot(query) / (corpus ** 2).sum() ** .5 / (query ** 2).sum() ** .5

d1    0.980581
d2    0.707107
d3    0.288675
d4    0.801784
d5    0.500000
d6    0.894315
dtype: float64

numpy

^{pr2}$

根据@Divakar的建议
np.einsum

c = corpus.values
q = query.values

r = (
        np.einsum('ji,j->i', c, q) /
        np.einsum('ij,ij->j', c, c) ** .5 /
        np.einsum('i,i', q, q) ** .5
    )

pd.Series(r, corpus.columns)

d1    0.980581
d2    0.707107
d3    0.288675
d4    0.801784
d5    0.500000
d6    0.894315
dtype: float64

您可以使用^{}'cosine'功能进行矢量化求解,如下-

from scipy.spatial.distance import cdist

out = 1-cdist(query.values[None], corpus.values.T, 'cosine')

样本运行-

^{pr2}$

运行时测试-

In [225]: corpus = pd.DataFrame(np.random.rand(100,10000))

In [226]: query = pd.Series(np.random.rand(100))

# @C.Square's apply based soln
In [227]: %timeit corpus.apply(lambda x:1-cosine(query, x), axis=0)
1 loop, best of 3: 352 ms per loop

 # Proposed in this post using cdist()
In [228]: %timeit 1-cdist(query.values[None], corpus.values.T, 'cosine')
100 loops, best of 3: 3.2 ms per loop

apply-ing函数是一种简洁、易读且快速的方法:

import pandas as pd
from scipy.spatial.distance import cosine

corpus = pd.DataFrame([[3,1,1,1,1,60],[2,2,0,2,0,20], [0,2,1,1,0,0], [0,0,2,1,0,1],[0,0,0,0,1,0]], index=["stark","groß","schwach","klein", "dick"], columns=["d1", "d2", "d3","d4","d5","d6"])
query = pd.Series([1,1,0,0,0], index=["stark","groß","schwach","klein", "dick"])

corpus.apply(lambda x:1-cosine(query, x),  # Apply your function
             axis=0)                       # For each column

# d1    0.980581
# d2    0.707107
# d3    0.288675
# d4    0.801784
# d5    0.500000
# d6    0.894315
# dtype: float64

相关问题 更多 >