扩展pandas数据框架,在列中包含日期范围

2024-04-26 22:48:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个pandas数据框,其日期和字符串类似于:

Start        End           Note    Item
2016-10-22   2016-11-05    Z       A
2017-02-11   2017-02-25    W       B

我需要将其展开/转换到下面,在开始结束列之间填写几周(W-SAT),并在注释项中向前填写数据

Start        Note    Item
2016-10-22   Z       A
2016-10-29   Z       A
2016-11-05   Z       A
2017-02-11   W       B
2017-02-18   W       B
2017-02-25   W       B

对熊猫最好的办法是什么?某种多索引应用?


Tags: 数据字符串pandasitemsatstartnoteend
3条回答

如果df['End'] - df['Start']的唯一值的数目不太大,但数据集中的行数很大,则以下函数将比在数据集中循环快得多:

def date_expander(dataframe: pd.DataFrame,
                  start_dt_colname: str,
                  end_dt_colname: str,
                  time_unit: str,
                  new_colname: str,
                  end_inclusive: bool) -> pd.DataFrame:
    td = pd.Timedelta(1, time_unit)

    # add a timediff column:
    dataframe['_dt_diff'] = dataframe[end_dt_colname] - dataframe[start_dt_colname]

    # get the maximum timediff:
    max_diff = int((dataframe['_dt_diff'] / td).max())

    # for each possible timediff, get the intermediate time-differences:
    df_diffs = pd.concat([pd.DataFrame({'_to_add': np.arange(0, dt_diff + end_inclusive) * td}).assign(_dt_diff=dt_diff * td)
                          for dt_diff in range(max_diff + 1)])

    # join to the original dataframe
    data_expanded = dataframe.merge(df_diffs, on='_dt_diff')

    # the new dt column is just start plus the intermediate diffs:
    data_expanded[new_colname] = data_expanded[start_dt_colname] + data_expanded['_to_add']

    # remove start-end cols, as well as temp cols used for calculations:
    data_expanded = data_expanded.drop(columns=[start_dt_colname, end_dt_colname, '_to_add', '_dt_diff'])

    # don't modify dataframe in place:
    del dataframe['_dt_diff']

    return data_expanded

你根本不需要迭代。

df_start_end = df.melt(id_vars=['Note','Item'],value_name='date')

df = df_start_end.groupby('Note').apply(lambda x: x.set_index('date').resample('W').pad()).drop(columns=['Note','variable']).reset_index()

您可以遍历每一行并创建一个新的数据帧,然后将它们连接在一起

pd.concat([pd.DataFrame({'Start': pd.date_range(row.Start, row.End, freq='W-SAT'),
               'Note': row.Note,
               'Item': row.Item}, columns=['Start', 'Note', 'Item']) 
           for i, row in df.iterrows()], ignore_index=True)

       Start Note Item
0 2016-10-22    Z    A
1 2016-10-29    Z    A
2 2016-11-05    Z    A
3 2017-02-11    W    B
4 2017-02-18    W    B
5 2017-02-25    W    B

相关问题 更多 >