Pandas groupby和文件写入问题

此函数应用于dataframe中的每个项

def item_grouper(df): # Get the frequency of each tag applied to the item tag_counts = df['tag'].value_counts() # Get the most frequent tag (or tags, assuming a tie) max_tags = tag_counts[tag_counts==tag_counts.max()] # Get the total nummber of annotations for the item total_anno = len(df) # Now, process each user who tagged the item return df.groupby('uid').apply(user_grouper,total_anno,max_tags,tag_counts) # This function gets applied to each user who tagged an item def user_grouper(df,total_anno,max_tags,tag_counts): # subtract user's annoations from total annoations for the item total_anno = total_anno - len(df) # calculate weight weight = np.log10(total_anno) # check if user has used (one of) the top tag(s), and adjust max_tag_count if len(np.intersect1d(max_tags.index.values,df['iid']))>0: max_tag_count = float(max_tags[0]-1) else: max_tag_count = float(max_tags[0]) # for each annotation... for i,row in df.iterrows(): # calculate raw score raw_score = (tag_counts[row['tag']]-1) / max_tag_count # write to file out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n') return df

因此，一个分组函数按iid（item id）对数据进行分组，进行一些处理，然后按uid（user_id）对每个子数据帧进行分组，进行一些计算，并写入输出文件。现在，输出文件应该在原始数据帧中每行有一行，但它没有！我总是把相同的数据多次写入文件。例如，如果我运行：

输出应该有1000行（代码在数据帧中每行只写一行），但是结果输出文件有1997行。查看文件可以看到完全相同的行被多次（2-4）次写入，似乎是随机的（也就是说，并非所有的行都是双重写入的）。你知道我做错什么了吗？在

2条回答

网友

1楼 · 编辑于 2024-04-25 22:55:56

我同意克里斯布对这个问题的判断。作为一种更简洁的方法，考虑让user_grouper()函数不保存任何值，而是返回这些值。结构为

def user_grouper(df, ...):
    (...)
    df['max_tag_count'] = some_calculation
    return df

results = df.groupby(...).apply(user_grouper, ...)
for i,row in results.iterrows():
    # calculate raw score
    raw_score = (tag_counts[row['tag']]-1) / row['max_tag_count']
    # write to file
    out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')

网友

2楼 · 编辑于 2024-04-25 22:55:56

请参阅应用程序上的docs。Pandas将在第一个组中调用函数两次（以确定快/慢代码路径之间的差别），因此第一组函数（IO）的副作用将发生两次。在

最好的办法是直接迭代组，如下所示：

for group_name, group_df in df.head(1000).groupby('iid'):
    item_grouper(group_df)

此函数应用于dataframe中的每个项

相关问题更多 >

编程相关推荐

热门问题

热门文章