按周分组一个数据帧

Date Articles 2010-01-04 ((though, reliant, advertis, revenu, internet,... 2010-01-05 ((googl, expect, nexus, one, rival, iphon, hel... 2010-01-06 ((while, googl, introduc, first, piec, hardwar... 2010-01-07 ((googl, form, energi, subsidiari, appli, gove... 2010-01-08 ((david, pogu, review, googl, new, offer, nexu... 2010-01-12 ((the, compani, agre, hand, list, book, scan, ...

3条回答

网友

1楼 · 编辑于 2024-06-16 11:01:17

这里有一个使用merge_asof和allow_exact_matches=False的两步解决方案，这样每个商品行与第一个日期严格大于（不等于）商品行日期的价格匹配。你知道吗

.agg(sum)使用这样一个事实：添加两个元组将它们组合成一个元组。你知道吗

假设您的数据帧名为df和df2：

# Test data adapted from your examples.
# Sorry that this is difficult to copy-paste into pandas

df
            Articles
2010-01-04  (though, reliant, advertis, revenu, internet)        
2010-01-05  ((googl, expect, nexus), (one, rival, iphon))        
2010-01-06  ((while, googl, introduc), (first,), (piec, hardwar))
2010-01-07  ((googl, form), (energi, subsidiari), (appli,))      
2010-01-08  ((david, pogu, review), (googl, new, offer))         
2010-01-12  ((the, compani), (agre, hand, list), (book, scan)) 

df2
            Price               
2010-01-08  602.020
2010-01-15  580.000
2010-01-22  550.010
2010-01-29  529.944


# Solution

price2articles = (pd.merge_asof(df, 
                               df2, 
                               left_index=True, 
                               right_index=True, 
                               allow_exact_matches=False,
                               direction='forward')
                .groupby('Price')
                .agg(sum))

result = pd.merge(df2, price2article, left_on='Price', right_index=True)
# To see full contents of wide data, set
# pd.options.display.max_colwidth = 150 or higher (-1 for no limit)
result

            Articles                                                                                                                                                                                                          
2010-01-08  (though, reliant, advertis, revenu, internet, (googl, expect, nexus), (one, rival, iphon), (while, googl, introduc), (first,), (piec, hardwar), (googl, form), (energi, subsidiari), (appli,))  
2010-01-15  ((david, pogu, review), (googl, new, offer), (the, compani), (agre, hand, list), (book, scan))

网友

2楼 · 编辑于 2024-06-16 11:01:17

我们可以利用^{}和^{}

df1 = pd.DataFrame({'Articles': 
                   {pd.Timestamp('2010-01-04 00:00:00'): [0, 1],
                    pd.Timestamp('2010-01-05 00:00:00'): [2, 3],
                    pd.Timestamp('2010-01-06 00:00:00'): [4, 5],
                    pd.Timestamp('2010-01-07 00:00:00'): [6, 7],
                    pd.Timestamp('2010-01-08 00:00:00'): [8, 9],
                    pd.Timestamp('2010-01-12 00:00:00'): [10, 11]}})

            Articles
2010-01-04  [0, 1]
2010-01-05  [2, 3]
2010-01-06  [4, 5]
2010-01-07  [6, 7]
2010-01-08  [8, 9]
2010-01-12  [10, 11]

mybins = pd.IntervalIndex.from_breaks(
             pd.date_range("2010-1-1", periods=5, freq="7D"),
             closed="left"
         )

df1["bin"] = pd.cut(df1.index, bins=mybins)
df1.groupby("bin")["Articles"].sum()

bin
[2010-01-01, 2010-01-08)    [0, 1, 2, 3, 4, 5, 6, 7]
[2010-01-08, 2010-01-15)              [8, 9, 10, 11]
[2010-01-15, 2010-01-22)                        None
[2010-01-22, 2010-01-29)                        None
Name: Articles, dtype: object

网友

3楼 · 编辑于 2024-06-16 11:01:17

我认为需要使用groupby和连接元组到list的值^{}：

print (df1)
        Date          Articles
0 2010-01-04  ((t, r), (s, q))
1 2010-01-07  ((g, f), (y, l))
2 2010-01-08  ((d, p), (t, o))
3 2010-01-12  ((t, c), (r, p))

b = pd.concat([df2['Date'], 
               pd.Series(pd.to_datetime(['1970-01-01','2100-01-01']))]).sort_values()

df1['Dates'] = pd.cut(df1['Date'], bins=b, labels=b[1:], right=False)
df3 = (df1.groupby('Dates')['Articles']
         .apply(lambda x: [i for s in x for i in s])
         .iloc[:-1]
         .reset_index())
print (df3)
       Dates                          Articles
0 2010-01-08  [(t, r), (s, q), (g, f), (y, l)]
1 2010-01-15  [(d, p), (t, o), (t, c), (r, p)]
2 2010-01-22                                []
3 2010-01-29                                []

Last if want filter out emptylists：

df3 = df3[df3['Articles'].astype(bool)]
print (df3)
       Dates                          Articles
0 2010-01-08  [(t, r), (s, q), (g, f), (y, l)]
1 2010-01-15  [(d, p), (t, o), (t, c), (r, p)]

相关问题更多 >

编程相关推荐

热门问题

热门文章