如何在Pandas中计算满足特定条件的行之间的天数差异

2 投票
2 回答
55 浏览
提问于 2025-04-14 17:45

我刚开始学习Pandas。
我有一个数据表,想要计算每个“ID”从“访问”到现在的天数差,也就是想得到“CV_days_after_visit”这个结果。
请问我该如何用Python中的Pandas来实现这个呢?

import pandas as pd
data1 = {'ID': ['A2A', 'A2A', 'A2A', 'BB3', 'BB3', 'BB3', '5EE', '5EE'], 'Action': ['Visit', 'CV', 'CV', 'Visit', 'Visit', 'CV', 'Visit', 'CV'], 'date': ['2023/4/1', '2023/4/5', '2023/4/7', '2023/5/5', '2023/5/29', '2023/5/30', '2023/6/1', '2023/6/10']}
df = pd.DataFrame(data1)

print (df)
    ID  Acton       date
0  A2A  Visit   2023/4/1
1  A2A     CV   2023/4/5
2  A2A     CV   2023/4/7
3  BB3  Visit   2023/5/5
4  BB3  Visit  2023/5/29
5  BB3     CV  2023/5/30
6  5EE  Visit   2023/6/1
7  5EE     CV  2023/6/10

enter image description here

提前谢谢你们。

2 个回答

1

你可以这样操作,这种方法比之前的回答稍微复杂一些,但如果你是初学者,我觉得这样可能更清晰,因为这些命令其实很简单:

df['date'] = pd.to_datetime(df['date'])

# Sort the DataFrame by 'ID' and 'date'
df = df.sort_values(by=['ID', 'date'])

# Filter rows where 'Action' is 'CV'
cv_rows = df[df['Action'] == 'CV']

# Find the last visit date for each ID
last_visit_dates = df[df['Action'] == 'Visit'].groupby('ID')['date'].last().reset_index()
last_visit_dates.columns = ['ID', 'last_visit_date']

# Merge last visit dates with CV rows
cv_rows = pd.merge(cv_rows, last_visit_dates, on='ID')

# Calculate the difference in days between CV date and last visit date
cv_rows['CV_days_after_visit'] = (cv_rows['date'] - cv_rows['last_visit_date']).dt.days

# Merge the calculated CV days after visit back to the original DataFrame
df = pd.merge(df, cv_rows[['ID', 'date', 'CV_days_after_visit']], on=['ID', 'date'], how='left')
5

你可以通过使用GroupBy.ffill来填充日期,然后用Series.sub来减去Visit,接着用Series.dt.days把时间差转换成天数。如果有必要,最后可以用Series.mask来去掉0天的记录。

#convert column to datetimes
df['date'] = pd.to_datetime(df['date'])

m = df['Acton'].eq('Visit')
df['CV_days_after_visit'] = (df['date'].sub(df['date'].where(m).groupby(df['ID']).ffill())
                                       .dt.days
                                       .mask(m))
print (df)
    ID  Acton       date  CV_days_after_visit
0  A2A  Visit 2023-04-01                  NaN
1  A2A     CV 2023-04-05                  4.0
2  A2A     CV 2023-04-07                  6.0
3  BB3  Visit 2023-05-05                  NaN
4  BB3  Visit 2023-05-29                  NaN
5  BB3     CV 2023-05-30                  1.0
6  5EE  Visit 2023-06-01                  NaN
7  5EE     CV 2023-06-10                  9.0

这个过程是怎么运作的:

df['date'] = pd.to_datetime(df['date'])

m = df['Acton'].eq('Visit')

print (df.assign(m1 = df['date'].where(m),
                 ffill = df['date'].where(m).groupby(df['ID']).ffill(),
                 sub = df['date'].sub(df['date'].where(m).groupby(df['ID']).ffill()),
                 days = df['date'].sub(df['date'].where(m).groupby(df['ID']).ffill())
                                  .dt.days))

    ID  Acton       date         m1      ffill    sub  days
0  A2A  Visit 2023-04-01 2023-04-01 2023-04-01 0 days     0
1  A2A     CV 2023-04-05        NaT 2023-04-01 4 days     4
2  A2A     CV 2023-04-07        NaT 2023-04-01 6 days     6
3  BB3  Visit 2023-05-05 2023-05-05 2023-05-05 0 days     0
4  BB3  Visit 2023-05-29 2023-05-29 2023-05-29 0 days     0
5  BB3     CV 2023-05-30        NaT 2023-05-29 1 days     1
6  5EE  Visit 2023-06-01 2023-06-01 2023-06-01 0 days     0
7  5EE     CV 2023-06-10        NaT 2023-06-01 9 days     9

撰写回答