用NaN计算行数

2024-06-08 15:32:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下数据帧:

dur  wage1  wage2  wage3  cola  hours     pension  stby_pay  shift_diff
6   3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
8   1.0    2.8    NaN    NaN  none   38.0  empl_contr       2.0         3.0
9   1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
13  1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
17  3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
31  1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
43  2.0    2.5    3.0    NaN   NaN   40.0        none       NaN         NaN
44  1.0    2.8    NaN    NaN  none   38.0  empl_contr       2.0         3.0
47  3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN

我要做的是计数完全相同的行,包括NaN值。在

问题是,我使用groupby,但它是一个忽略NaN值的函数,也就是说,它在进行计数时没有考虑到它们,这就是为什么我没有返回一个正确的输出,计算这些行之间的精确重复次数。在

我的代码如下:

^{pr2}$

如果我打印“x”var,我得到这个结果,它显示所有重复的行:

dur  wage1  wage2  wage3  cola  hours     pension  stby_pay  shift_diff
6   3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
8   1.0    2.8    NaN    NaN  none   38.0  empl_contr       2.0         3.0
9   1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
13  1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
17  3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
31  1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
43  2.0    2.5    3.0    NaN   NaN   40.0        none       NaN         NaN
44  1.0    2.8    NaN    NaN  none   38.0  empl_contr       2.0         3.0
47  3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
51  3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
53  2.0    2.5    3.0    NaN   NaN   40.0        none       NaN         NaN

现在我要计算x结果中完全相同的行。在

这应该是我的正确输出:

 dur    wage1   wage2   wage3   cola    hours   pension stby_pay    shift_diff  num_reps
6   3.0 2.0 3.0 NaN tcf NaN empl_contr  NaN NaN                4
8   1.0 2.8 NaN NaN none    38.0    empl_contr  2.0 3.0        2
9   1.0 5.7 NaN NaN none    40.0    empl_contr  NaN 4.0        3
43  2.0 2.5 3.0 NaN NaN 40.0    none    NaN NaN                2

这是我的问题,groupby忽略了NaN值,这就是为什么其他关于这个问题的类似帖子不能帮助我。在

谢谢


Tags: noneshiftnanpayhoursdurpensiontcf
2条回答

如果dataframe的名称为df,则只需使用一行代码即可计算重复的数量:

sum(df.duplicated(keep = False))

如果要删除重复行,请使用drop\u duplicates方法。documentation

示例:

^{pr2}$

导入数据.csv以及删除重复行(默认情况下,保留重复行的第一个实例)

import pandas as pd
df = pd.read_csv("data.csv")
print(df.drop_duplicates())
#Output
   c1   c2   c3
0   a   3   NaN
1   b   9   4.0
2   c   12  5.0
5   d   19  20.0

要计算重复行的数量,请使用dataframe的duplicated方法。将“keep”设置为False(documentation)。如上所述,您可以简单地使用sum(df.duplicated(keep = False))来完成此操作。这里有一个更为混乱的方法来演示“复制”方法的作用:

duplicate_rows = df.duplicated(keep = False)
print(duplicate_rows)

#count the number of duplicates (i.e. count the number of 'True' values in 
#the duplicate_rows boolean series.

number_of_duplicates = sum(duplicate_rows)

print("Number of duplicate rows:")

print(number_of_duplicates)

#Output

#The duplicate_rows pandas series from df.duplicated(keep = False)
0     True
1     True
2    False
3     True
4     True
5    False
6     True
dtype: bool

#The number of rows from sum(df.duplicated(keep = False))
Number of duplicate rows:
5

我刚刚解决了。在

我说的问题是groupby不接受Nan值。在

所以我要解决的是,用fillna(0)函数改变所有的Nan值,这样它就把所有的Nan都改为0,现在我可以正确地进行比较了。在

以下是我的新功能正常工作:

def detect_duplicates(data):
    x = DataFrame(columns=data.columns.tolist() + ["num_reps"])

    aux = data[data.duplicated(keep=False)]
    x = data[data.duplicated(keep=False)].drop_duplicates()
    s =  aux.fillna(0).groupby(data.columns.tolist()).size().reset_index().rename(columns={0:'count'})
    x['num_reps'] = s['count'].tolist()[::-1]

    return x

相关问题 更多 >