如何在pandas DataFrame中的pct_change计算中按多个列分组
我正在对一个pandas数据框进行pct_change计算。当月份列是有序的时候,一切都正常。但如果没有排序,计算结果就会出错。
这是我现在的代码:
data = [
('product_a','1/31/2014',53)
,('product_b','1/31/2014',44)
,('product_c','1/31/2014',36)
,('product_a','11/30/2013',52)
,('product_b','11/30/2013',43)
,('product_c','11/30/2013',35)
,('product_a','3/31/2014',50)
,('product_b','3/31/2014',41)
,('product_c','3/31/2014',34)
,('product_a','12/31/2013',50)
,('product_b','12/31/2013',41)
,('product_c','12/31/2013',34)
,('product_a','2/28/2014',52)
,('product_b','2/28/2014',43)
,('product_c','2/28/2014',35)
]
product_df = DataFrame( data, columns=['prod_desc','activity_month','prod_count'] )
for index, row in product_df.iterrows():
row['activity_month']= datetime.strptime(row['activity_month'],'%m/%d/%Y')
product_df.loc[index, 'activity_month'] = date.strftime(row['activity_month'],'%Y-%m-%d')
product_df['pct_ch'] = product_df.groupby('prod_desc')['prod_count'].pct_change()
product_df = product_df.sort(['prod_desc','activity_month'])
我得到的结果是:
prod_desc activity_month prod_count pct_ch
3 product_a 2013-11-30 52 -0.018868
9 product_a 2013-12-31 50 0.000000
0 product_a 2014-01-31 53 NaN
12 product_a 2014-02-28 52 0.040000
6 product_a 2014-03-31 50 -0.038462
4 product_b 2013-11-30 43 -0.022727
10 product_b 2013-12-31 41 0.000000
1 product_b 2014-01-31 44 NaN
13 product_b 2014-02-28 43 0.048780
7 product_b 2014-03-31 41 -0.046512
5 product_c 2013-11-30 35 -0.027778
11 product_c 2013-12-31 34 0.000000
2 product_c 2014-01-31 36 NaN
14 product_c 2014-02-28 35 0.029412
8 product_c 2014-03-31 34 -0.028571
这里的计算结果是错乱的,因为每个产品的第一个月份的pct_change应该是NaN(也就是没有值)。
我认为问题出在pct_change计算没有把'activity_month'包含在分组中。当我尝试添加它时,得到的输出是:
product_df['pct_ch'] = product_df.groupby(['prod_desc','activity_month'])['prod_count'].pct_change()
prod_desc activity_month prod_count pct_ch
3 product_a 2013-11-30 52 NaN
9 product_a 2013-12-31 50 NaN
0 product_a 2014-01-31 53 NaN
12 product_a 2014-02-28 52 NaN
6 product_a 2014-03-31 50 NaN
4 product_b 2013-11-30 43 NaN
10 product_b 2013-12-31 41 NaN
1 product_b 2014-01-31 44 NaN
13 product_b 2014-02-28 43 NaN
7 product_b 2014-03-31 41 NaN
5 product_c 2013-11-30 35 NaN
11 product_c 2013-12-31 34 NaN
2 product_c 2014-01-31 36 NaN
14 product_c 2014-02-28 35 NaN
8 product_c 2014-03-31 34 NaN
1 个回答
1
我觉得你遇到的问题是,groupby在计算相邻行中相同的prod_desc
的百分比差异时,没有按照日期的顺序来处理。所以把排序放在groupby之前就能解决这个问题。你还可以去掉for循环,用一行代码来实现,使用pandas库就可以做到。
import pandas as pd
data = [
('product_a','1/31/2014',53)
,('product_b','1/31/2014',44)
,('product_c','1/31/2014',36)
,('product_a','11/30/2013',52)
,('product_b','11/30/2013',43)
,('product_c','11/30/2013',35)
,('product_a','3/31/2014',50)
,('product_b','3/31/2014',41)
,('product_c','3/31/2014',34)
,('product_a','12/31/2013',50)
,('product_b','12/31/2013',41)
,('product_c','12/31/2013',34)
,('product_a','2/28/2014',52)
,('product_b','2/28/2014',43)
,('product_c','2/28/2014',35)
]
product_df = pd.DataFrame( data, columns=['prod_desc','activity_month','prod_count'])
product_df['activity_month'] = pd.to_datetime(product_df['activity_month'],
format='%m/%d/%Y')
product_df = product_df.sort_values(['prod_desc','activity_month'])
product_df['pct_ch'] = product_df.groupby('prod_desc')['prod_count'].pct_change()
我认为这样应该能得到你想要的结果。
prod_desc activity_month prod_count pct_ch
3 product_a 2013-11-30 52 NaN
9 product_a 2013-12-31 50 -0.038462
0 product_a 2014-01-31 53 0.060000
12 product_a 2014-02-28 52 -0.018868
6 product_a 2014-03-31 50 -0.038462
4 product_b 2013-11-30 43 NaN
10 product_b 2013-12-31 41 -0.046512
1 product_b 2014-01-31 44 0.073171
13 product_b 2014-02-28 43 -0.022727
7 product_b 2014-03-31 41 -0.046512
5 product_c 2013-11-30 35 NaN
11 product_c 2013-12-31 34 -0.028571
2 product_c 2014-01-31 36 0.058824
14 product_c 2014-02-28 35 -0.027778
8 product_c 2014-03-31 34 -0.028571