如何在Pandas中计算满足特定条件的行之间的天数差异
我刚开始学习Pandas。
我有一个数据表,想要计算每个“ID”从“访问”到现在的天数差,也就是想得到“CV_days_after_visit”这个结果。
请问我该如何用Python中的Pandas来实现这个呢?
import pandas as pd
data1 = {'ID': ['A2A', 'A2A', 'A2A', 'BB3', 'BB3', 'BB3', '5EE', '5EE'], 'Action': ['Visit', 'CV', 'CV', 'Visit', 'Visit', 'CV', 'Visit', 'CV'], 'date': ['2023/4/1', '2023/4/5', '2023/4/7', '2023/5/5', '2023/5/29', '2023/5/30', '2023/6/1', '2023/6/10']}
df = pd.DataFrame(data1)
print (df)
ID Acton date
0 A2A Visit 2023/4/1
1 A2A CV 2023/4/5
2 A2A CV 2023/4/7
3 BB3 Visit 2023/5/5
4 BB3 Visit 2023/5/29
5 BB3 CV 2023/5/30
6 5EE Visit 2023/6/1
7 5EE CV 2023/6/10
提前谢谢你们。
2 个回答
1
你可以这样操作,这种方法比之前的回答稍微复杂一些,但如果你是初学者,我觉得这样可能更清晰,因为这些命令其实很简单:
df['date'] = pd.to_datetime(df['date'])
# Sort the DataFrame by 'ID' and 'date'
df = df.sort_values(by=['ID', 'date'])
# Filter rows where 'Action' is 'CV'
cv_rows = df[df['Action'] == 'CV']
# Find the last visit date for each ID
last_visit_dates = df[df['Action'] == 'Visit'].groupby('ID')['date'].last().reset_index()
last_visit_dates.columns = ['ID', 'last_visit_date']
# Merge last visit dates with CV rows
cv_rows = pd.merge(cv_rows, last_visit_dates, on='ID')
# Calculate the difference in days between CV date and last visit date
cv_rows['CV_days_after_visit'] = (cv_rows['date'] - cv_rows['last_visit_date']).dt.days
# Merge the calculated CV days after visit back to the original DataFrame
df = pd.merge(df, cv_rows[['ID', 'date', 'CV_days_after_visit']], on=['ID', 'date'], how='left')
5
你可以通过使用GroupBy.ffill
来填充日期,然后用Series.sub
来减去Visit
,接着用Series.dt.days
把时间差转换成天数。如果有必要,最后可以用Series.mask
来去掉0
天的记录。
#convert column to datetimes
df['date'] = pd.to_datetime(df['date'])
m = df['Acton'].eq('Visit')
df['CV_days_after_visit'] = (df['date'].sub(df['date'].where(m).groupby(df['ID']).ffill())
.dt.days
.mask(m))
print (df)
ID Acton date CV_days_after_visit
0 A2A Visit 2023-04-01 NaN
1 A2A CV 2023-04-05 4.0
2 A2A CV 2023-04-07 6.0
3 BB3 Visit 2023-05-05 NaN
4 BB3 Visit 2023-05-29 NaN
5 BB3 CV 2023-05-30 1.0
6 5EE Visit 2023-06-01 NaN
7 5EE CV 2023-06-10 9.0
这个过程是怎么运作的:
df['date'] = pd.to_datetime(df['date'])
m = df['Acton'].eq('Visit')
print (df.assign(m1 = df['date'].where(m),
ffill = df['date'].where(m).groupby(df['ID']).ffill(),
sub = df['date'].sub(df['date'].where(m).groupby(df['ID']).ffill()),
days = df['date'].sub(df['date'].where(m).groupby(df['ID']).ffill())
.dt.days))
ID Acton date m1 ffill sub days
0 A2A Visit 2023-04-01 2023-04-01 2023-04-01 0 days 0
1 A2A CV 2023-04-05 NaT 2023-04-01 4 days 4
2 A2A CV 2023-04-07 NaT 2023-04-01 6 days 6
3 BB3 Visit 2023-05-05 2023-05-05 2023-05-05 0 days 0
4 BB3 Visit 2023-05-29 2023-05-29 2023-05-29 0 days 0
5 BB3 CV 2023-05-30 NaT 2023-05-29 1 days 1
6 5EE Visit 2023-06-01 2023-06-01 2023-06-01 0 days 0
7 5EE CV 2023-06-10 NaT 2023-06-01 9 days 9