按周分组一个数据帧

2024-06-16 11:01:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据框:

Date        Articles
2010-01-04  ((though, reliant, advertis, revenu, internet,...
2010-01-05  ((googl, expect, nexus, one, rival, iphon, hel...
2010-01-06  ((while, googl, introduc, first, piec, hardwar...
2010-01-07  ((googl, form, energi, subsidiari, appli, gove...
2010-01-08  ((david, pogu, review, googl, new, offer, nexu...
2010-01-12  ((the, compani, agre, hand, list, book, scan, ...

日期是索引,冠词是元组的元组。你知道吗

我有另一个数据帧:

Date        Price
2010-01-08  602.020
2010-01-15  580.000
2010-01-22  550.010
2010-01-29  529.944

其中日期也是索引,但被划分为周。你知道吗

我的问题是,我想在第二个数据帧中创建另一个列,该列将包含该特定周的所有文章,由索引指示。就像我的第二个数据框中的第一行一样,我希望所有的文章都是从2010-01-08之前的第一个数据框中挖出的(所以这将是我的第一个数据框中的前4个条目)。像wise for 2010-01-15一样,我需要从2010-01-08到2010-01-14的所有文章,以此类推。你知道吗

任何帮助都将不胜感激。谢谢。你知道吗


Tags: 数据nexusdate文章oneinternetarticlesexpect
3条回答

这里有一个使用merge_asofallow_exact_matches=False的两步解决方案,这样每个商品行与第一个日期严格大于(不等于)商品行日期的价格匹配。你知道吗

.agg(sum)使用这样一个事实:添加两个元组将它们组合成一个元组。你知道吗

假设您的数据帧名为dfdf2

# Test data adapted from your examples.
# Sorry that this is difficult to copy-paste into pandas

df
            Articles
2010-01-04  (though, reliant, advertis, revenu, internet)        
2010-01-05  ((googl, expect, nexus), (one, rival, iphon))        
2010-01-06  ((while, googl, introduc), (first,), (piec, hardwar))
2010-01-07  ((googl, form), (energi, subsidiari), (appli,))      
2010-01-08  ((david, pogu, review), (googl, new, offer))         
2010-01-12  ((the, compani), (agre, hand, list), (book, scan)) 

df2
            Price               
2010-01-08  602.020
2010-01-15  580.000
2010-01-22  550.010
2010-01-29  529.944


# Solution

price2articles = (pd.merge_asof(df, 
                               df2, 
                               left_index=True, 
                               right_index=True, 
                               allow_exact_matches=False,
                               direction='forward')
                .groupby('Price')
                .agg(sum))

result = pd.merge(df2, price2article, left_on='Price', right_index=True)
# To see full contents of wide data, set
# pd.options.display.max_colwidth = 150 or higher (-1 for no limit)
result

            Articles                                                                                                                                                                                                          
2010-01-08  (though, reliant, advertis, revenu, internet, (googl, expect, nexus), (one, rival, iphon), (while, googl, introduc), (first,), (piec, hardwar), (googl, form), (energi, subsidiari), (appli,))  
2010-01-15  ((david, pogu, review), (googl, new, offer), (the, compani), (agre, hand, list), (book, scan))

我们可以利用^{}^{}

df1 = pd.DataFrame({'Articles': 
                   {pd.Timestamp('2010-01-04 00:00:00'): [0, 1],
                    pd.Timestamp('2010-01-05 00:00:00'): [2, 3],
                    pd.Timestamp('2010-01-06 00:00:00'): [4, 5],
                    pd.Timestamp('2010-01-07 00:00:00'): [6, 7],
                    pd.Timestamp('2010-01-08 00:00:00'): [8, 9],
                    pd.Timestamp('2010-01-12 00:00:00'): [10, 11]}})

            Articles
2010-01-04  [0, 1]
2010-01-05  [2, 3]
2010-01-06  [4, 5]
2010-01-07  [6, 7]
2010-01-08  [8, 9]
2010-01-12  [10, 11]

mybins = pd.IntervalIndex.from_breaks(
             pd.date_range("2010-1-1", periods=5, freq="7D"),
             closed="left"
         )

df1["bin"] = pd.cut(df1.index, bins=mybins)
df1.groupby("bin")["Articles"].sum()

bin
[2010-01-01, 2010-01-08)    [0, 1, 2, 3, 4, 5, 6, 7]
[2010-01-08, 2010-01-15)              [8, 9, 10, 11]
[2010-01-15, 2010-01-22)                        None
[2010-01-22, 2010-01-29)                        None
Name: Articles, dtype: object

我认为需要使用groupby和连接元组到list的值^{}

print (df1)
        Date          Articles
0 2010-01-04  ((t, r), (s, q))
1 2010-01-07  ((g, f), (y, l))
2 2010-01-08  ((d, p), (t, o))
3 2010-01-12  ((t, c), (r, p))

b = pd.concat([df2['Date'], 
               pd.Series(pd.to_datetime(['1970-01-01','2100-01-01']))]).sort_values()

df1['Dates'] = pd.cut(df1['Date'], bins=b, labels=b[1:], right=False)
df3 = (df1.groupby('Dates')['Articles']
         .apply(lambda x: [i for s in x for i in s])
         .iloc[:-1]
         .reset_index())
print (df3)
       Dates                          Articles
0 2010-01-08  [(t, r), (s, q), (g, f), (y, l)]
1 2010-01-15  [(d, p), (t, o), (t, c), (r, p)]
2 2010-01-22                                []
3 2010-01-29                                []

Last if want filter out emptylists

df3 = df3[df3['Articles'].astype(bool)]
print (df3)
       Dates                          Articles
0 2010-01-08  [(t, r), (s, q), (g, f), (y, l)]
1 2010-01-15  [(d, p), (t, o), (t, c), (r, p)]

相关问题 更多 >