当计数等于

2024-06-02 05:00:14 发布

您现在位置:Python中文网/ 问答频道 /正文

以下数据集表示购买行为:

user_id, product_code, bought_date, time_spent, store_id, product_type, refurbished, unqiue_visit_id
001, e.12, 20120102, 104, 101, computer, yes, 1010
002, e.24, 20120201, 100, 101, infant-dress, no, 2001
003, s.32, 20130302, 230, 101, shoes, no, 2121
004, y.23, 20130404, 212, 103, computer, yes, 2422
005, s.43, 20130803, 104, 101, laptop, yes, 2342
001, a.12, 20120102, 104, 101, computer, yes, 1011
002, b.24, 20120201, 100, 101, infant-dress, no, 2001
003, c.32, 20130302, 230, 101, shoes, no, 2122
004, e.23, 20130404, 212, 103, computer, yes, 2424
005, f.43, 20130803, 104, 101, laptop, yes, 2340
001, g.12, 20120102, 104, 101, computer, yes, 1013
002, h.24, 20120201, 100, 101, infant-dress, no, 2031
003, l.32, 20130302, 230, 101, shoes, no, 2000
004, m.23, 20130404, 212, 103, computer, yes, 1422
005, d.43, 20130803, 104, 101, laptop, yes, 1142
001, d.12, 20120102, 104, 101, desk, yes, 1110
002, f.24, 20120201, 100, 101, glass, no, 1111
003, n.32, 20130302, 230, 101, liquid, no, 2021
004, t.23, 20130404, 212, 103, liquid, yes, 22
005, u.43, 20130803, 104, 101, dress, yes, 2942
001, d.12, 20120102, 104, 101, desk, yes, 1910
002, f.24, 20120201, 100, 101, glass, no, 2901
003, n.32, 20130302, 230, 101, liquid, no, 2921
004, t.23, 20130404, 212, 103, liquid, yes, 2922
005, u.43, 20130803, 104, 101, dress, yes, 2942
001, kk.12, 20120103, 105, 101, desk, yes, 410
003, n.32, 20130303, 230, 101, liquid, no, 2621

最终目标是使用以下步骤将产品类型分配给用户

第一个I按user_idproduct_type分组,并获得用户在product_type访问过的访问次数(计数)

当count在组(user_idproduct_id)内相等时,选择用户最近访问的产品类型并将其分配给用户。如果访问日期相等,那么我们可以通过查看refurbished(yes > no)来打破这种关系

visit_counts = merged_visits_df.groupby(['user_id','product_type'], as_index=False).agg({'unique_visits_id': 'nunique'})

上面给出了访问计数,并试图计算出流程的其余部分


Tags: no用户idtypeproductcomputeryesuser
1条回答
网友
1楼 · 发布于 2024-06-02 05:00:14

我想下面是你要问的(列名在你发布的数据中是错的,我是这样保存的,即“unqiue\u visit\u id”)

counts = (
    # sort by bought date
    merged_visits_df.sort_values('bought_date', ascending=False)
    # groupby desired cols
    .groupby(['user_id','product_type'],as_index=False)
    # apply desired aggregation functions
    .agg({'unqiue_visit_id': 'nunique', 'bought_date': 'first', 'refurbished': 'first'})
)

然后我们可以通过用户id获得最大访问次数

max_by_user = counts.groupby('user_id')['unqiue_visit_id'].max()

最后,我们可以过滤那些访问量=用户最大访问量的项目,按所需列进行排序,然后得到第一个

result = (
    # filter to products with max visits by user
    counts[counts['user_id'].apply(max_by_user.get) == counts['unqiue_visit_id']]
    # sort bought_date descending (max on top), refurbished descending (yes above no)
    .sort_values(['bought_date', 'refurbished'], ascending=False)
    # groupby user id and select the first
    .groupby('user_id').nth(0)
)

这样想可能更直观一些:

第一步: 添加要排序的列:

 # initial question
 visits_df = merged_visits_df.groupby(['user_id','product_type']).agg({'unqiue_visit_id': 'nunique'}).add_suffix('_count')
 df_to_sort = merged_visits_df.merge(visits_df.reset_index())
 # follow up question
 df_to_sort['last_num'] = df_to_sort['store_id'] % 10

然后排序,分组,首先获取:

(
    df_to_sort
    .sort_values([unqiue_visit_id_count, bought_date, last_num], ascending=[False, False, True])
    .groupby(['user_id']).nth(0)
)

相关问题 更多 >