使用所有不同的值进行自连接,并在中应用聚合函数

2024-05-31 23:36:21 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个Python数据框架,它的列为paper, author, col_1, col_2, ..., col_100

Dtypes:
Paper type: string (unique)
Author type: string
col_x: float

我知道我要做的事情很复杂,而且性能很高,但我的解决方案需要花费很长时间才能完成

对于DataFrame中的每一行,我希望与该行中与author不同的所有作者进行自联接。然后在col_x的值中应用一个函数,并将上的每一行与另一行col_x连接起来,并获得一些聚合结果

我的解决方案使用了我知道最慢的iterrows,但我想不出任何其他方法

from sklearn.metrics.pairwise import cosine_similarity
from statistics import mean 

papers = ... #is my dataframe
cols = ['min_col', 'avg_col', 'max_col', 'label']
all_cols = ['col_1', 'col_2', ..., 'col_100']

df_result = pd.DataFrame({}, columns = cols)

for ind, paper in papers.iterrows():

    col_vector = paper[all_cols].values.reshape(1,-1) #bring the columns in the correct format   
    temp = papers[papers.author != paper.author].author.unique() #get all authors that are not the same with the one in the row
    for auth in temp:
        temp_papers = papers[papers.author == auth]  #get all papers of that author 
        if temp_papers.shape[0] > 1: #if I have more than 1 paper find the cosine_similarity of the row and the joined rows
            res = []
            for t_ind, t_paper in temp_papers.iterrows():
                res.append(cosine_similarity(col_vector, t_paper[all_cols].values.reshape(1,-1))[0][0])

            df_result = df_result.append(pd.DataFrame([[min(res), mean(res), max(res), 0]], columns = cols), ignore_index = True)

第2版:

我还尝试对自身进行交叉连接,然后排除具有相同作者的行。然而,当我这样做时,我在几行中得到相同的错误

papers['key'] = 0['key'] = 0
cross = papers.merge(papers, on = 'key', how = 'outer')
>> [IPKernelApp] WARNING | No such comm: 3a1ea2fa71f711ea847aacde48001122

额外信息

  • 数据帧的大小为45k行

  • 大约有5千名独立作者


Tags: theindataframerescol作者alltemp
1条回答
网友
1楼 · 发布于 2024-05-31 23:36:21

首先,如果数据帧不是太大(在您的例子中,它似乎太大),您可以通过使用cosine_similarity的矢量化来实现。要做到这一点,首先需要一个所有作者都有一行以上的掩码,创建一个数据框,在索引和列中包含足够的信息,以便能够分组,然后查询所需的行:

# here are dummy variables
np.random.seed(10)
papers = pd.DataFrame({'author': list('aabbcdddae'), 
                       'col_1': np.random.randint(30, size=10), 
                       'col_2': np.random.randint(20, size=10), 
                       'col_3': np.random.randint(10, size=10),})
all_cols = ['col_1', 'col_2','col_3']

第一个解决方案:

#mask author with more than 1 row
mask_author = papers.groupby('author')['author'].transform('count').gt(1)

# use cosine_similarity with all the rows at a time
# compared to all the rows with authors with more than a row
df_f = (pd.DataFrame(cosine_similarity(papers.loc[:,all_cols],papers.loc[mask_author,all_cols]), 
                     # create index and columns to keep some info about authors
                     index=pd.MultiIndex.from_frame(papers['author'].reset_index(), 
                                                    names=['index_ori', 'author_ori']), 
                     columns=papers.loc[mask_author,'author'])
          # put all columns as rows to be able to perform a groupby all index levels and agg
          .stack()
          .groupby(level=[0,1,2], axis=0).agg([min, 'mean', max])
          # remove rows that compared authors with themself
          .query('author_ori != author')
          # add label column with 0, not sure why
          .assign(label=0)
          # reset index as you don't seem to care
          .reset_index(drop=True))

现在的问题是,有45K行和5K名作者,我怀疑普通计算机能否处理前面的方法。然后,我们的想法是执行相同的操作,但每个组作者:

# mask for authors with more than a row
mask_author = papers.groupby('author')['author'].transform('count').gt(1)
# instead of doing it for each iteration, save the df with authors with more than a row
papers_gt1 = papers.loc[mask_author, :]

# compared to your method, it is more efficient to same dataframes in a list and concat at the end
# than using append on a dataframe at each iteration
res = []
# iterate over each authors
for auth, dfg in papers[all_cols].groupby(papers['author']):
    # mask for to remove the current author of the comparison df
    mask_auth = papers_gt1['author'].ne(auth)
    # append the dataframe build on the same idea than the first solution
    # with small difference in operation as you already have not the same author in both 
    # dfg and papers_gt1.loc[mask_auth, all_cols]
    res.append(pd.DataFrame(cosine_similarity(dfg, papers_gt1.loc[mask_auth, all_cols]), 
                            columns=papers_gt1.loc[mask_auth, 'author'])
                 .stack()
                 .groupby(level=[0, 1]).agg([min, 'mean', max]))
#outside of the loop concat everything and add label column
df_f = pd.concat(res, ignore_index=True).assign(label=0)

注意:整个操作仍然很长,但在您的代码中,您在多个级别上降低了效率,如果您想保持iterrows,这里有几点可以提高代码的效率:

  • 正如您所提到的,不建议使用iterrows,但是两个iterrow加上另一个循环确实很慢
  • 第二个iterrows没有利用cosine_similarity对于具有多个维度的输入数组被否决的优势
  • 在每次迭代中执行temp = papers[papers.author != paper.author].author.unique()是一个巨大的时间损失,可以在循环之前创建唯一作者列表,然后在循环中检查当前的paper.authorauth不同(使用您的符号)
  • 同样的想法,在每个auth之前都可以做if temp_papers.shape[0] > 1,我假设纸张的数量没有变化,因此如果创建唯一的auth外部循环列表(上一点),它可能已经不包括只有一篇论文的作者
  • 最后,在每个循环的数据帧上使用append是一个巨大的时间损失,请参见this answer进行计时比较,因此最好创建另一个列表res_agg,您可以这样做res_agg.append([min(res), mean(res), max(res), 0]),在所有循环之后,df_result=pd.DataFrame(res_agg, columns=cols)

相关问题 更多 >