使用所有不同的值进行自连接，并在中应用聚合函数

from sklearn.metrics.pairwise import cosine_similarity from statistics import mean papers = ... #is my dataframe cols = ['min_col', 'avg_col', 'max_col', 'label'] all_cols = ['col_1', 'col_2', ..., 'col_100'] df_result = pd.DataFrame({}, columns = cols) for ind, paper in papers.iterrows(): col_vector = paper[all_cols].values.reshape(1,-1) #bring the columns in the correct format temp = papers[papers.author != paper.author].author.unique() #get all authors that are not the same with the one in the row for auth in temp: temp_papers = papers[papers.author == auth] #get all papers of that author if temp_papers.shape[0] > 1: #if I have more than 1 paper find the cosine_similarity of the row and the joined rows res = [] for t_ind, t_paper in temp_papers.iterrows(): res.append(cosine_similarity(col_vector, t_paper[all_cols].values.reshape(1,-1))[0][0]) df_result = df_result.append(pd.DataFrame([[min(res), mean(res), max(res), 0]], columns = cols), ignore_index = True)

1条回答

网友

1楼 · 发布于 2024-05-31 23:36:21

首先，如果数据帧不是太大（在您的例子中，它似乎太大），您可以通过使用cosine_similarity的矢量化来实现。要做到这一点，首先需要一个所有作者都有一行以上的掩码，创建一个数据框，在索引和列中包含足够的信息，以便能够分组，然后查询所需的行：

# here are dummy variables
np.random.seed(10)
papers = pd.DataFrame({'author': list('aabbcdddae'), 
                       'col_1': np.random.randint(30, size=10), 
                       'col_2': np.random.randint(20, size=10), 
                       'col_3': np.random.randint(10, size=10),})
all_cols = ['col_1', 'col_2','col_3']

第一个解决方案：

#mask author with more than 1 row
mask_author = papers.groupby('author')['author'].transform('count').gt(1)

# use cosine_similarity with all the rows at a time
# compared to all the rows with authors with more than a row
df_f = (pd.DataFrame(cosine_similarity(papers.loc[:,all_cols],papers.loc[mask_author,all_cols]), 
                     # create index and columns to keep some info about authors
                     index=pd.MultiIndex.from_frame(papers['author'].reset_index(), 
                                                    names=['index_ori', 'author_ori']), 
                     columns=papers.loc[mask_author,'author'])
          # put all columns as rows to be able to perform a groupby all index levels and agg
          .stack()
          .groupby(level=[0,1,2], axis=0).agg([min, 'mean', max])
          # remove rows that compared authors with themself
          .query('author_ori != author')
          # add label column with 0, not sure why
          .assign(label=0)
          # reset index as you don't seem to care
          .reset_index(drop=True))

现在的问题是，有45K行和5K名作者，我怀疑普通计算机能否处理前面的方法。然后，我们的想法是执行相同的操作，但每个组作者：

# mask for authors with more than a row
mask_author = papers.groupby('author')['author'].transform('count').gt(1)
# instead of doing it for each iteration, save the df with authors with more than a row
papers_gt1 = papers.loc[mask_author, :]

# compared to your method, it is more efficient to same dataframes in a list and concat at the end
# than using append on a dataframe at each iteration
res = []
# iterate over each authors
for auth, dfg in papers[all_cols].groupby(papers['author']):
    # mask for to remove the current author of the comparison df
    mask_auth = papers_gt1['author'].ne(auth)
    # append the dataframe build on the same idea than the first solution
    # with small difference in operation as you already have not the same author in both 
    # dfg and papers_gt1.loc[mask_auth, all_cols]
    res.append(pd.DataFrame(cosine_similarity(dfg, papers_gt1.loc[mask_auth, all_cols]), 
                            columns=papers_gt1.loc[mask_auth, 'author'])
                 .stack()
                 .groupby(level=[0, 1]).agg([min, 'mean', max]))
#outside of the loop concat everything and add label column
df_f = pd.concat(res, ignore_index=True).assign(label=0)

注意：整个操作仍然很长，但在您的代码中，您在多个级别上降低了效率，如果您想保持iterrows，这里有几点可以提高代码的效率：

正如您所提到的，不建议使用iterrows，但是两个iterrow加上另一个循环确实很慢
第二个iterrows没有利用cosine_similarity对于具有多个维度的输入数组被否决的优势
在每次迭代中执行temp = papers[papers.author != paper.author].author.unique()是一个巨大的时间损失，可以在循环之前创建唯一作者列表，然后在循环中检查当前的paper.author与auth不同（使用您的符号）
同样的想法，在每个auth之前都可以做if temp_papers.shape[0] > 1，我假设纸张的数量没有变化，因此如果创建唯一的auth外部循环列表（上一点），它可能已经不包括只有一篇论文的作者
最后，在每个循环的数据帧上使用append是一个巨大的时间损失，请参见this answer进行计时比较，因此最好创建另一个列表res_agg，您可以这样做res_agg.append([min(res), mean(res), max(res), 0])，在所有循环之后，df_result=pd.DataFrame(res_agg, columns=cols)

相关问题更多 >

编程相关推荐

热门问题

热门文章