在MultiIndex中为缺失日期插入0值

6 投票

2 回答

5595 浏览

提问于 2025-04-17 15:45

假设我有一个多重索引（MultiIndex），它包含日期和一些类别（为了简单起见，这里只用一个类别），对于每个类别，我有一个时间序列，记录某个过程的值。只有在有观察值的时候，我才会有数据，现在我想在没有观察值的日期上添加一个“0”。我找到了一种方法，但感觉效率很低（因为要进行堆叠和解堆叠，这样会在类别很多的情况下产生很多列）。

import datetime as dt
import pandas as pd

days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x)
    for x in range(days)]
df = pd.DataFrame([
    (datetime.date(2013, 2, 10), 1, 4),
    (datetime.date(2013, 2, 10), 2, 7),
    (datetime.date(2013, 2, 11), 2, 7),
    (datetime.date(2013, 2, 13), 1, 2),
    (datetime.date(2013, 2, 13), 2, 3)],
    columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)
print df
print df.unstack().reindex(all_dates).fillna(0).stack()
# insert 0 values for missing dates
print all_dates

                        value
date       category       
2013-02-10 1             4
           2             7
2013-02-11 2             7
2013-02-13 1             2
           2             3

                      value
            category       
2013-02-13 1             2
           2             3
2013-02-12 1             0
           2             0
2013-02-11 1             0
           2             7
2013-02-10 1             4
           2             7
[datetime.date(2013, 2, 13), datetime.date(2013, 2, 12),
    datetime.date(2013, 2, 11),     datetime.date(2013, 2, 10)]

有没有人知道更聪明的方法来实现这个？

补充：我找到了一种其他的方法来实现同样的效果：

import datetime as dt
import pandas as pd

days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x) for x in range(days)]
df = pd.DataFrame([(datetime.date(2013, 2, 10), 1, 4, 5),
(datetime.date(2013, 2, 10), 2,1, 7),
(datetime.date(2013, 2, 10), 2,2, 7),
(datetime.date(2013, 2, 11), 2,3, 7),
(datetime.date(2013, 2, 13), 1,4, 2),
(datetime.date(2013, 2, 13), 2,4, 3)],
columns = ['date', 'category', 'cat2', 'value'])
date_col = 'date'
other_index = ['category', 'cat2']
index = [date_col] + other_index
df.set_index(index, inplace=True)
grouped = df.groupby(level=other_index)
df_list = []
for i, group in grouped:
    df_list.append(group.reset_index(level=other_index).reindex(all_dates).fillna(0))
print pd.concat(df_list).set_index(other_index, append=True)

                    value
           category cat2       
2013-02-13 1        4         2
2013-02-12 0        0         0
2013-02-11 0        0         0
2013-02-10 1        4         5
2013-02-13 0        0         0
2013-02-12 0        0         0
2013-02-11 0        0         0
2013-02-10 2        1         7
2013-02-13 0        0         0
2013-02-12 0        0         0
2013-02-11 0        0         0
2013-02-10 2        2         7
2013-02-13 0        0         0
2013-02-12 0        0         0
2013-02-11 2        3         7
2013-02-10 0        0         0
2013-02-13 2        4         3
2013-02-12 0        0         0
2013-02-11 0        0         0
2013-02-10 0        0         0

效率优化数据填充 pandas 时间序列缺失值处理数据框架 multiindex

2 个回答

看看这个回答：如何以Python的方式填补Pandas数据框中的缺失记录？

你可以这样做：

import datetime
import pandas as pd

#make an empty dataframe with the index you want
def get_datetime(x):
    return datetime.date(2013, 2, 13)- datetime.timedelta(days=x)

all_dates = [ get_datetime(x) for x in range(4)]
categories = [1,2,3,4]
index = [ [date, cat] for cat in categories for date in all_dates ]

#this df will be just an index
df = pd.DataFrame(index)
df =print df.set_index([0,1])
df.columns = ['date', 'category']
df = df.set_index(['date', 'category'])


#now if your original df is called df_original you can reindex against the other values
df_orig = df_orig.reindex_axis(df.index)

#and to add zeros
df_orig.fillna(0)

回答于 2025-04-17 由 Python大师

分享举报

你可以根据你想要的索引层的笛卡尔积来创建一个新的多重索引。然后，用这个新索引来重新整理你的数据框。

(date_index, category_index) = df.index.levels
new_index = pd.MultiIndex.from_product([all_dates, category_index])
new_df = df.reindex(new_index)

# Optional: convert missing values to zero, and convert the data back
# to integers. See explanation below.
new_df = new_df.fillna(0).astype(int)

就这样！新的数据框包含了所有可能的索引值，原有的数据也被正确地索引了。

继续往下看，会有更详细的解释。

解释

准备示例数据

import datetime as dt
import pandas as pd

days= 4
#List of all dates that should be in the index
all_dates = [dt.date(2013, 2, 13) - dt.timedelta(days=x)
    for x in range(days)]
df = pd.DataFrame([
    (dt.date(2013, 2, 10), 1, 4),
    (dt.date(2013, 2, 10), 2, 7),
    (dt.date(2013, 2, 11), 2, 7),
    (dt.date(2013, 2, 13), 1, 2),
    (dt.date(2013, 2, 13), 2, 3)],
    columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)

这是示例数据的样子

                     value
date       category
2013-02-10 1             4
           2             7
2013-02-11 2             7
2013-02-13 1             2
           2             3

创建新索引

使用 from_product 方法，我们可以创建一个新的多重索引。这个新索引是你传给这个函数的所有值的笛卡尔积。

(date_index, category_index) = df.index.levels

new_index = pd.MultiIndex.from_product([all_dates, category_index])

重新索引

用新索引来重新整理现有的数据框。

现在所有可能的组合都在了，缺失的值用空值（NaN）表示。

new_df = df.reindex(new_index)

现在，扩展后的重新索引的数据框看起来是这样的：

              value
2013-02-13 1    2.0
           2    3.0
2013-02-12 1    NaN
           2    NaN
2013-02-11 1    NaN
           2    7.0
2013-02-10 1    4.0
           2    7.0

整数列中的空值

你可以看到，新的数据框中的数据已经从整数变成了浮点数。Pandas 不支持整数列中的空值。你可以选择把所有的空值转换为0，并把数据再转换回整数。

new_df = new_df.fillna(0).astype(int)

结果

              value
2013-02-13 1      2
           2      3
2013-02-12 1      0
           2      0
2013-02-11 1      0
           2      7
2013-02-10 1      4
           2      7