获取基于分组的连续发生次数

2024-04-19 05:28:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图找到一种方法来获取按主机分组并按时间排序的连续事件组。理想情况下,我需要满足某个treshold和isCorrect == false的组

范例

Time    |   Host    |   isCorrect   |
-------------------------------------
10:01   |   HostA   |   true        |
10:02   |   HostB   |   true        |
10:03   |   HostA   |   false       |
10:15   |   HostA   |   false       |
10:16   |   HostA   |   false       |
10:18   |   HostB   |   false       |
10:20   |   HostA   |   true        |
10:21   |   HostA   |   true        |
10:22   |   HostB   |   false       |
10:23   |   HostB   |   false       |

阈值:>=三,

将导致两组

Time    |   Host    |   isCorrect   | Group
--------------------------------------------
10:03   |   HostA   |   false       |1
10:15   |   HostA   |   false       |1
10:16   |   HostA   |   false       |1

10:18   |   HostB   |   false       |2
10:22   |   HostB   |   false       |2
10:23   |   HostB   |   false       |2

我正在读https://towardsdatascience.com/pandas-dataframe-group-by-consecutive-certain-values-a6ed8e5d8cc,但找不到先按主机分组的方法


1条回答
网友
1楼 · 发布于 2024-04-19 05:28:45

首先通过使用~反转掩码和排序值(如有必要)过滤False值,然后使用阈值过滤组,最后通过^{}创建Group列:

df = df[~df['isCorrect']].sort_values(['Host','Time'])
mask = df['Host'].map(df['Host'].value_counts()) >= 3

df = df[mask].copy()
df['Group'] = pd.factorize(df['Host'])[0] + 1
print (df)

    Time   Host  isCorrect  Group
2  10:03  HostA      False      1
3  10:15  HostA      False      1
4  10:16  HostA      False      1
5  10:18  HostB      False      2
8  10:22  HostB      False      2
9  10:23  HostB      False      2

如果按连续的False分组:

m = ~df['isCorrect']
df['Group'] = df['isCorrect'].cumsum()[m]

df = df[m].sort_values(['Host','Time'])

mask = df.groupby(['Group', 'Host'])['Group'].transform('size') >= 3

df = df[mask].copy()
df['Group'] = pd.factorize(df['Host'])[0] + 1
print (df)
    Time   Host  isCorrect  Group
2  10:03  HostA      False      1
3  10:15  HostA      False      1
4  10:16  HostA      False      1

相关问题 更多 >