月度经常性支出的Python代码
我正在尝试找出每月重复的支出,这些支出需要有相同的ID、类别和金额。而且这些支出必须至少在连续的3个月中出现。
我写的代码没有得到正确的结果。
import pandas as pd
# Updated Sample DataFrame
data = {
'id': [1, 1, 1, 2, 2, 2, 1, 1, 2, 2,2,2],
'date': ['2023-01-17', '2023-01-15', '2022-11-16', '2023-01-16', '2022-12-14', '2022-11-10', '2022-12-20', '2022-12-10', '2023-01-25', '2022-11-05','2022-10-05','2022-09-07'],
'category': ['Groceries', 'Utilities', 'Groceries', 'Groceries', 'Utilities', 'Groceries', 'Groceries', 'Utilities', 'Groceries', 'Utilities','Utilities','Utilities'],
'amount': [50, 50, 50, 60, 80, 80, 50, 55, 75, 75,75,75]
}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date']) # Convert 'date' column to datetime format
# Sort DataFrame
df = df.sort_values(by=['id','category','amount', 'date'], ascending=[True, True,True,False])
df['recurring'] = 0
for i in range(len(df)-2):
if (df.iloc[i]['id'] == df.iloc[i+1]['id'] == df.iloc[i+1]['id']) and \
(df.iloc[i]['category'] == df.iloc[i+1]['category'] == df.iloc[i+1]['category']) and \
df.iloc[i]['amount'] == df.iloc[i+1]['amount'] == df.iloc[i+1]['amount'] and \
(df.iloc[i]['date'].to_period('M') - df.iloc[i-1]['date'].to_period('M')) == 1 and (df.iloc[i-1]['date'].to_period('M') - df.iloc[i-2]['date'].to_period('M')) == 1:
df['recurring'] = 1
print(df)
但是我得到的结果是:所有重复支出的值都是0。我的逻辑哪里出错了呢?
id date category amount recurring
0 1 2023-01-17 Groceries 50 0
6 1 2022-12-20 Groceries 50 0
2 1 2022-11-16 Groceries 50 0
1 1 2023-01-15 Utilities 50 0
7 1 2022-12-10 Utilities 55 0
3 2 2023-01-16 Groceries 60 0
8 2 2023-01-25 Groceries 75 0
5 2 2022-11-10 Groceries 80 0
9 2 2022-11-05 Utilities 75 0
10 2 2022-10-05 Utilities 75 0
11 2 2022-09-07 Utilities 75 0
4 2 2022-12-14 Utilities 80 0
我期望的输出是:
id date category amount recurring
0 1 2023-01-17 Groceries 50 1
6 1 2022-12-20 Groceries 50 0
2 1 2022-11-16 Groceries 50 0
1 1 2023-01-15 Utilities 50 0
7 1 2022-12-10 Utilities 55 0
3 2 2023-01-16 Groceries 60 0
8 2 2023-01-25 Groceries 75 0
5 2 2022-11-10 Groceries 80 0
9 2 2022-11-05 Utilities 75 1
10 2 2022-10-05 Utilities 75 0
11 2 2022-09-07 Utilities 75 0
4 2 2022-12-14 Utilities 80 0
1 个回答
0
你可以使用 groupby.rolling
,不过因为按月份滚动的功能不支持,所以你需要把日期转换成按月的时间段:
df['period'] = pd.to_datetime(df['date']).dt.to_period('M')
group = ['id', 'category', 'amount']
N = 3 # number of consecutive months
df['recurring'] = (df
.assign(recurring=1)
.sort_values(by=group+['date'])
# ensure we count a type of group only once
.drop_duplicates(group+['period'])
.groupby(group).rolling(N, on='period')['recurring']
# check that we have N consecutive months in the window
.apply(lambda x: x.index.nunique() == N and
(x.index.max()-x.index.min()).n == N-1)
.fillna(0).astype(int)
.reset_index(name='recurring')
.merge(df[group+['period']], how='right')
['recurring'].to_numpy()
)
输出结果:
id date category amount recurring period
0 1 2023-01-17 Groceries 50 1 2023-01
6 1 2022-12-20 Groceries 50 0 2022-12
2 1 2022-11-16 Groceries 50 0 2022-11
1 1 2023-01-15 Utilities 50 0 2023-01
7 1 2022-12-10 Utilities 55 0 2022-12
3 2 2023-01-16 Groceries 60 0 2023-01
8 2 2023-01-25 Groceries 75 0 2023-01
5 2 2022-11-10 Groceries 80 0 2022-11
9 2 2022-11-05 Utilities 75 1 2022-11
10 2 2022-10-05 Utilities 75 0 2022-10
11 2 2022-09-07 Utilities 75 0 2022-09
4 2 2022-12-14 Utilities 80 0 2022-12
另外,你也可以使用按天的窗口(rolling
不支持按月份的频率):
df['date'] = pd.to_datetime(df['date'])
window = '63D'
N = 3
df['recurring'] = (df
.sort_values(by=['id', 'category', 'date'])
.groupby(['id', 'category']).rolling(window, on='date')
['amount'].apply(lambda x: len(x)).ge(N).astype(int)
.reset_index(name='recurring')
.merge(df[['id', 'category', 'date']].drop_duplicates(),
how='right')
['recurring']
)