删除数据帧中的重复项，同时保留多数元素

Cat Date 1 A 2019-12-30 2 A 2019-12-30 3 A 2020-12-30 4 A 2020-01-06 5 A 2020-01-06 6 B 2020-01-06 7 B 2020-01-13 8 B 2020-01-13 9 A 2020-01-13 . . . . . .

3条回答

网友

1楼 · 编辑于 2024-04-19 11:16:06

我会考虑旧^ {< CD1> }

df.groupby(["Cat", "Date"]).size()\
  .reset_index(name="to_drop")\
  .drop("to_drop", axis=1)

或者，您可以对两列使用拖放副本

df.drop_duplicates(['Date',"Cat"])

网友

2楼 · 编辑于 2024-04-19 11:16:06

尝试^{}对date列上groupby之后的所有值进行计数：

df.groupby("Date").agg(lambda x: x.value_counts().index[0])
#            Cat
# Date
# 2019-12-30   A
# 2020-01-06   A
# 2020-01-13   B
# 2020-12-30   A

解释：

使用^{}根据Date将数据帧拆分为组
使用^{}应用聚合。此函数接受聚合组的函数
定义聚合函数：
3.1。使用^{}函数获取每个组的值数：

print(df.groupby("Date").agg(lambda x: x.value_counts()))
#                Cat
# Date
# 2019-12-30       2
# 2020-01-06  [3, 2]
# 2020-01-13  [2, 1]
# 2020-12-30       1

注意：^{}方法的结果是一个有序的序列

3.2。然而，我们实际上想要的是values，而不是count。诀窍是在序列上使用index

print(df.groupby("Date").agg(lambda x: x.value_counts().index))
#                Cat
# Date
# 2019-12-30       A
# 2020-01-06  [B, A]
# 2020-01-13  [B, A]
# 2020-12-30       A

3.3。最后，选择第一个值：

print(df.groupby("Date").agg(lambda x: x.value_counts().index[0]))
#            Cat
# Date
# 2019-12-30   A
# 2020-01-06   B
# 2020-01-13   B
# 2020-12-30   A

网友

3楼 · 编辑于 2024-04-19 11:16:06

这里有一个简单的解决方案

def removeDuplicatesKeepBest(df):
    # sort the data frame 
    df.sort_values(by="Cat")
    # Look only in the date column and only keep the first occurence if there is a dulplicate
    df.drop_duplicates(subset = "Date" , keep = 'first', inplace = True)

    return df

希望这有帮助

相关问题更多 >

编程相关推荐

热门问题

热门文章

删除数据帧中的重复项，同时保留多数元素

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >