是否有一种简单的方法来扩展/完成Pandas数据框架，以包含多列缺失的观测值？

>>> df = pd.DataFrame({ 'category1': list('AABAAB'), 'category2': list('xyxxyx'), 'year': [2000, 2000, 2000, 2002, 2002, 2002], 'value': [0, 1, 0, 4, 3, 4] }) >>> df category1 category2 year value 0 A x 2000 0 1 A y 2000 1 2 B x 2000 0 3 A x 2002 4 4 A y 2002 3 5 B x 2002 4

category1 category2 year value 0 A x 2000 0.0 1 A y 2000 1.0 2 B x 2000 0.0 3 A x 2001 NaN 4 A y 2001 NaN 5 B x 2001 NaN 6 A x 2002 4.0 7 A y 2002 3.0 8 B x 2002 4.0

id_cols = ['category1', 'category2'] df_out = (df.pivot_table(index=id_cols, values='value', columns='year') .reindex(columns=range(2000, 2003)) .stack(dropna=False) .sort_index(level=-1) .reset_index(name='value')) category1 category2 year value 0 A x 2000 0.0 1 A y 2000 1.0 2 B x 2000 0.0 3 A x 2001 NaN 4 A y 2001 NaN 5 B x 2001 NaN 6 A x 2002 4.0 7 A y 2002 3.0 8 B x 2002 4.0

2条回答

网友

1楼 · 编辑于 2024-05-16 01:15:58

让我们做stack和unstack

dfout=df.set_index(['year','category1','category2']).\
         value.unstack(level=0).\
         reindex(columns=range(2000,2003)).\
         stack(dropna=False).to_frame('value').\
         sort_index(level=2).reset_index()
  category1 category2  year  value
0         A         x  2000    0.0
1         A         y  2000    1.0
2         B         x  2000    0.0
3         A         x  2001    NaN
4         A         y  2001    NaN
5         B         x  2001    NaN
6         A         x  2002    4.0
7         A         y  2002    3.0
8         B         x  2002    4.0

网友

2楼 · 编辑于 2024-05-16 01:15:58

#create a separate dataframe and merge with df to get ur new form
a = ('A','x',range(2000,2003))
b = ('A','y',range(2000,2003))
c = ('B','x',range(2000,2003))

from itertools import product, chain
res = ((product(*ent)) for ent in (a,b,c))
columns = ['category1','category2','year']
fake = pd.DataFrame(chain.from_iterable(res),columns=columns)
fake.merge(df,on=columns,how='left').sort_values('year',ignore_index=True)

category1   category2   year    value
0   A   x   2000    0.0
1   A   y   2000    1.0
2   B   x   2000    0.0
3   A   x   2001    NaN
4   A   y   2001    NaN
5   B   x   2001    NaN
6   A   x   2002    4.0
7   A   y   2002    3.0
8   B   x   2002    4.0

或者：

fake = df.drop_duplicates(['category1','category2']).filter(['category1','category2'])

fake.index = [2001]*len(fake)
#merge two indexes on year    
pd.concat((df.set_index('year'),fake)).sort_index()

更新2021/01/08：

您可以使用pyjanitor中的complete函数来抽象流程；目前，您必须从github安装最新的开发版本：

# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor

df.complete(
    columns=[
        ("category1", "category2"),
        {"year": lambda df: [2000, 2001, 2002]},
    ]
)

  category1 category2   year    value
0   A           x       2000    0.0
1   A           x       2001    NaN
2   A           x       2002    4.0
3   A           y       2000    1.0
4   A           y       2001    NaN
5   A           y       2002    3.0
6   B           x       2000    0.0
7   B           x       2001    NaN
8   B           x       2002    4.0

complete函数的工作原理是传递一个列列表，其中包含要完成的缺少值。这个想法的灵感来自tidyr's complete函数。由于该问题需要year列的新值，因此可以通过字典传递一个可调用函数，该函数将使用新值

相关问题更多 >

编程相关推荐

热门问题

热门文章