截取区间以添加特定日期
我有一个比较大的数据集,里面有70万行和100多个列,包含多个entity_id
和多个时间区间。
有很多列attr
与不同的值相关联。
我想把这些时间区间切分开来,以便为每个entity_id
整合specific_dt
。
在切分时间区间时,新创建的区间会继承它们父级的attr
值。
下面是一个小的可复现示例
have = {'entity_id': [1,1,2,2],
'start_date': ['2014-12-01 00:00:00', '2015-03-01 00:00:00', '2018-02-12 00:00:00', '2019-02-01 00:00:00'],
'end_date': ['2015-02-28 23:59:59', '2015-05-31 23:59:59', '2019-01-31 23:59:59', '2023-05-28 23:59:59'],
'attr1': ['A', 'B', 'D', 'J']}
have = pd.DataFrame(data=have)
have
entity_id start_date end_date attr1
0 1 2014-12-01 00:00:00 2015-02-28 23:59:59 A
1 1 2015-03-01 00:00:00 2015-05-31 23:59:59 B
2 2 2018-02-12 00:00:00 2019-01-31 23:59:59 D
3 2 2019-02-01 00:00:00 2023-05-28 23:59:59 J
# Specific dates to integrate
specific_dt = ['2015-01-01 00:00:00', '2015-03-31 00:00:00']
期望的输出结果如下
want
entity_id start_date end_date attr1
0 1 2014-12-01 2014-12-31 23:59:59 A
0 1 2015-01-01 2015-02-28 23:59:59 A
1 1 2015-03-01 2015-03-30 23:59:59 B
1 1 2015-03-31 2015-05-31 23:59:59 B
2 2 2018-02-12 2019-01-31 23:59:59 D
3 2 2019-02-01 2023-05-28 23:59:59 J
我已经用以下代码实现了想要的输出
# Create a list to store the new rows
new_rows = []
# Iterate through each row in the initial DataFrame
for index, row in have.iterrows():
id_val = row['entity_id']
start_date = pd.to_datetime(row['start_date'])
end_date = pd.to_datetime(row['end_date'], errors = 'coerce')
# Iterate through specific dates and create new rows
for date in specific_dt:
specific_date = pd.to_datetime(date)
# Check if the specific date is within the interval
if start_date < specific_date < end_date:
# Create a new row with all columns and append it to the list
new_row = row.copy()
new_row['start_date'] = start_date
new_row['end_date'] = specific_date - pd.Timedelta(seconds=1)
new_rows.append(new_row)
# Update the start_date for the next iteration
start_date = specific_date
# Add the last part of the original interval as a new row
new_row = row.copy()
new_row['start_date'] = start_date
new_row['end_date'] = end_date
new_rows.append(new_row)
# Create a new DataFrame from the list of new rows
want = pd.DataFrame(data=new_rows)
不过对于我的工作数据集来说,这个过程非常慢(超过10分钟)。有没有可能优化一下(也许可以去掉for循环)?
作为参考,我在sas
中能在几秒钟内完成这个操作,下面的例子是针对要整合的两个特定日期之一。
data want;
set have;
by entity_id start_date end_date;
if start_date < "31MAR2015"d < end_date then
do;
retain _start _end;
_start = start_date;
_end = end_date;
end_date = "30MAR2015"d;
output;
start_date = "31MAR2015"d;
end_date = _end;
output;
end;
else output;
drop _start _end;
run;
1 个回答
1
你可以试试这个:
have["start_date"] = pd.to_datetime(have["start_date"])
have["end_date"] = pd.to_datetime(have["end_date"])
specific_dt = [
pd.to_datetime("2015-01-01 00:00:00"),
pd.to_datetime("2015-03-31 00:00:00"),
]
l = [have]
for dt in specific_dt:
mask = (have["start_date"] < dt) & (have["end_date"] > dt)
new_df = have.loc[mask]
have.loc[mask, "end_date"] = dt - pd.Timedelta(seconds=1)
new_df.loc[:, "start_date"] = dt
l.append(new_df)
want = pd.concat(l).sort_values(["entity_id", "attr1"])
entity_id start_date end_date attr1
0 1 2014-12-01 2014-12-31 23:59:59 A
0 1 2015-01-01 2015-02-28 23:59:59 A
1 1 2015-03-01 2015-03-30 23:59:59 B
1 1 2015-03-31 2015-05-31 23:59:59 B
2 2 2018-02-12 2019-01-31 23:59:59 D
3 2 2019-02-01 2023-05-28 23:59:59 J