如何在基于criteri的多索引数据帧中添加缺少的日期行

2024-05-17 17:02:37 发布

男 | 程序猿一只，喜欢编程写python代码。

我有大约750万行数据，格式如下：

ndc_description               effective_date        ...                             
12-HR DECONGEST 120MG CAPLET  2015-08-19            2015-08-26          G   NaN     NaN     1   0.36062     36800005452     Y   C/I     EA
                              2015-07-22            2015-08-12          G   NaN     NaN     1   0.37681     36800005452     Y   C/I     EA
                              2015-06-17            2015-07-15          G   NaN     NaN     1   0.36651     36800005452     Y   C/I     EA
Some Other drug               2016-11-21            2015-08-26          G   NaN     NaN     1   0.36062     36800005452     Y   C/I     EA
                              2016-07-23            2015-08-12          G   NaN     NaN     1   0.37681     36800005452     Y   C/I     EA
                              2016-05-17            2015-07-15          G   NaN     NaN     1   0.36651     36800005452     Y   C/I     EA

国家数据中心描述和生效日期是一个多指标

我有一个额外的数据集，我正在与上面的合并。它们将由列ndc\u description和effective\u date合并（显示的其他列纯粹是为了演示数据集中存在其他各种类型的数据）

当前问题：每个数据集中的日期不匹配。在上面的数据集中，他们（大部分）是每周一次，但这不能保证。在另一个数据集中，也没有保证的规律性。因此，我想我需要在上面的“生效日期”列中列出的日期之间为所有日期添加行，这样我就可以在ndc\u description和“生效日期”上进行合并这是执行此过程的最佳方法吗？由于涉及大量数据，我希望在运行所有数据之前优化涉及的代码

可能的解决方案：我已经看到了.resample（）在这里可能有价值，但我还没能让它工作。类似这样的：Cleaned_Price_Data.effective_date.resample('1D', fill_method = 'ffill', level = 1)

我认为加入max&；还可以提高效率；min在上面的某个地方指定日期，这样它就不会ffill任何多余的东西。另外，在ndc\u description中包含该值，以便添加到生效日期列的日期不会对每个相同的ndc\u description值重复

编辑： 下面的一些代码说明了我的数据帧的当前状态，以及在转换完成后它应该如何处理。我正在尝试转换这样的数据帧：

idx = pd.MultiIndex.from_product([['drug_a', 'drug_b', 'drug_c'],
                                  ['2015-08-19', '2015-08-17', '2015-08-14']],
                                 names=['drug_name', 'effective_date'])
col = ['other_data_1', 'other_data_2', 'other_data_3']

pre_transform = pd.DataFrame('-', idx, col)
pre_transform

对于这样一个（注：日期已添加）：

idx = pd.MultiIndex.from_product([['drug_a', 'drug_b', 'drug_c'],
                                  ['2015-08-19', '2015-08-18', '2015-08-17', '2015-08-16', '2015-08-15', '2015-08-14']],
                                 names=['drug_name', 'effective_date'])
col = ['other_data_1', 'other_data_2', 'other_data_3']

post_change = pd.DataFrame('-', idx, col)
post_change

编辑2:我想出了下面的代码（通过Parfait的答案here），它似乎可以做到这一点：

def expand_dates(ser):
    return pd.DataFrame({'effective_date': pd.date_range(ser['effective_date'].min(), ser['effective_date'].max(), freq='D')})

price_cols = list(Cleaned_Price_Data.columns)

all_effective_dates = Cleaned_Price_Data.groupby(['ndc']).apply(expand_dates).reset_index().merge(Cleaned_Price_Data, how = 'left')[price_cols].ffill()

然而，在5500万行中，这个文件非常臃肿，我将尝试将其与另一个数据集合并。任何试图优化这一点（或建议一个更有效的替代方案）将不胜感激

Tags：数据 data date description nan price pd other

0条回答

目前没有回答

如何在基于criteri的多索引数据帧中添加缺少的日期行

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在基于criteri的多索引数据帧中添加缺少的日期行

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >