Python数据帧填充不存在

2024-04-20 10:33:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我想知道是否有一种有效的方法将行添加到一个数据帧中,例如,如果没有足够的行来容纳另一列中的特定值,那么就包括一个平均值或一个预先定义的值。我想对问题的描述不是最好的,这就是为什么您会发现下面的示例:
假设我们有数据帧

df1
Client NumberOfProducts ID
A      1                2
A      5                1
B      1                2
B      6                1
C      9                1

我们希望每个客户机A、B、C、D有两行,不管这两行是否已经存在。所以对于客户机A和B,我们可以只复制行,对于C,我们要添加一行,表示Client=C,NumberOfProducts=average of existing rows=9,ID不感兴趣(因此我们可以将其设置为ID=minimest existing one-1=0,任何其他值,甚至NaN,都是可能的)。对于客户机D,没有一行,因此我们要添加两行,其中NumberOfProducts等于常量2.5。输出应该如下所示:

df1
Client NumberOfProducts ID
A      1                2
A      5                1
B      1                2
B      6                1
C      9                1
C      9                0
D      2.5              NaN
D      2.5              NaN

到目前为止,我所做的是循环遍历数据帧,并在必要时添加行。由于这是相当低效的任何更好的解决方案将高度赞赏。你知道吗


Tags: of数据方法clientid示例客户机定义
1条回答
网友
1楼 · 发布于 2024-04-20 10:33:49

用途:

clients = ['A','B','C','D']
N = 2

#test only values from list and also filter only 2 rows for each client if necessary
df = df[df['Client'].isin(clients)].groupby('Client').head(N)

#create helper counter and reshape by unstack
df1 = df.set_index(['Client',df.groupby('Client').cumcount()]).unstack()
#set first if only 1 row per client - replace second NumberOfProducts by first 
df1[('NumberOfProducts',1)] = df1[('NumberOfProducts',1)].fillna(df1[('NumberOfProducts',0)])
# ... replace second ID by first subtracted by 1
df1[('ID',1)] = df1[('ID',1)].fillna(df1[('ID',0)] - 1)
#add missing clients by reindex
df1 = df1.reindex(clients)
#replace NumberOfProducts by constant 2.5
df1['NumberOfProducts'] = df1['NumberOfProducts'].fillna(2.5)
print (df1)
       NumberOfProducts        ID     
                      0    1    0    1
Client                                
A                   1.0  5.0  2.0  1.0
B                   1.0  6.0  2.0  1.0
C                   9.0  9.0  1.0  0.0
D                   2.5  2.5  NaN  NaN

#last reshape to original
df2 = df1.stack().reset_index(level=1, drop=True).reset_index()
print (df2)
  Client  NumberOfProducts   ID
0      A               1.0  2.0
1      A               5.0  1.0
2      B               1.0  2.0
3      B               6.0  1.0
4      C               9.0  1.0
5      C               9.0  0.0
6      D               2.5  NaN
7      D               2.5  NaN

相关问题 更多 >