识别数据中的连续序列或布尔数据组

2024-04-20 05:37:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个基于布尔时间的数据集。按照下面的例子。我对突出显示数据集中超过三个1的连续序列感兴趣。我想在一个名为[Continuous_out_x]的新专栏中介绍这一点。是否有任何有效的操作来实现这一点

我以这种方式生成了测试数据:

df = pd.DataFrame(zip(list(np.random.randint(2, size=20)),list(np.random.randint(2, size=20))), columns=['tag1','tag2'] ,index=pd.date_range('2020-01-01', periods=20, freq='s'))

我得到的结果如下:

print (df):
                        tag1  tag2
2020-01-01 00:00:00     0     0
2020-01-01 00:00:01     1     0
2020-01-01 00:00:02     1     0
2020-01-01 00:00:03     1     1
2020-01-01 00:00:04     1     0
2020-01-01 00:00:05     1     0
2020-01-01 00:00:06     1     1
2020-01-01 00:00:07     0     1
2020-01-01 00:00:08     0     0
2020-01-01 00:00:09     1     1
2020-01-01 00:00:10     1     0
2020-01-01 00:00:11     0     1
2020-01-01 00:00:12     1     0
2020-01-01 00:00:13     0     1
2020-01-01 00:00:14     0     1
2020-01-01 00:00:15     0     1
2020-01-01 00:00:16     1     1
2020-01-01 00:00:17     0     0
2020-01-01 00:00:18     0     1
2020-01-01 00:00:19     1     0

此示例集(如上)的解决方案如下所示:

print(df):
                         tag1  tag2  Continuous_out_1  Continuous_out_2
2020-01-01 00:00:00     0     0                 0                 0
2020-01-01 00:00:01     1     0                 1                 0
2020-01-01 00:00:02     1     0                 1                 0
2020-01-01 00:00:03     1     1                 1                 0
2020-01-01 00:00:04     1     0                 1                 0
2020-01-01 00:00:05     1     0                 1                 0
2020-01-01 00:00:06     1     1                 1                 0
2020-01-01 00:00:07     0     1                 0                 0
2020-01-01 00:00:08     0     0                 0                 0
2020-01-01 00:00:09     1     1                 0                 0
2020-01-01 00:00:10     1     0                 0                 0
2020-01-01 00:00:11     0     1                 0                 0
2020-01-01 00:00:12     1     0                 0                 0
2020-01-01 00:00:13     0     1                 0                 1
2020-01-01 00:00:14     0     1                 0                 1
2020-01-01 00:00:15     0     1                 0                 1
2020-01-01 00:00:16     1     1                 0                 1
2020-01-01 00:00:17     0     0                 0                 0
2020-01-01 00:00:18     0     1                 0                 0
2020-01-01 00:00:19     1     0                 0                 0

Tags: 数据dfsizenp时间randomoutlist
2条回答

您可以识别连续的真/假区域,并检查它们是否大于您的截止值

for colname, series in df.items():
    new = f'Continuous_{colname}'
    df[new] = series.diff().ne(0).cumsum() # label contiguous regions
    df[new] = series.groupby(df[new]).transform('size') # get size of region
    df[new] = df[new].gt(3) * series # mark with cutoff

输出

                     tag1  tag2  Continuous_tag1  Continuous_tag2
index
2020-01-01 00:00:00     0     0                0                0
2020-01-01 00:00:01     1     0                1                0
2020-01-01 00:00:02     1     0                1                0
2020-01-01 00:00:03     1     1                1                0
2020-01-01 00:00:04     1     0                1                0
2020-01-01 00:00:05     1     0                1                0
2020-01-01 00:00:06     1     1                1                0
2020-01-01 00:00:07     0     1                0                0
2020-01-01 00:00:08     0     0                0                0
2020-01-01 00:00:09     1     1                0                0
2020-01-01 00:00:10     1     0                0                0
2020-01-01 00:00:11     0     1                0                0
2020-01-01 00:00:12     1     0                0                0
2020-01-01 00:00:13     0     1                0                1
2020-01-01 00:00:14     0     1                0                1
2020-01-01 00:00:15     0     1                0                1
2020-01-01 00:00:16     1     1                0                1
2020-01-01 00:00:17     0     0                0                0
2020-01-01 00:00:18     0     1                0                0
2020-01-01 00:00:19     1     0                0                0

您可以通过以下方式执行此操作:

  • 创建区分每个条纹(组)的系列
  • 将bool分配给具有三行以上的组

代码

# ok to loop over a few columns, still very performant
for col in ["tag1", "tag2"]:
    col_no = col[-1]
    df[f"group_{col}"] = np.cumsum(df[col].shift(1) != df[col])
    df[f"{col}_counts"] = df.groupby(f"group_{col}").tag1.transform("count") > 3
    df[f"Continuous_out_{col_no}"] = df[f"{col}_counts"].astype(int)
    df = df.drop(columns=[f"group_{col}", f"{col}_counts"])
    

输出

                     tag1  tag2  Continuous_out_1  Continuous_out_2
2020-01-01 00:00:00     0     0                 0                 0
           00:00:01     1     0                 1                 0
           00:00:02     1     0                 1                 0
           00:00:03     1     1                 1                 0
           00:00:04     1     0                 1                 0
           00:00:05     1     0                 1                 0
           00:00:06     1     1                 1                 0
           00:00:07     0     1                 0                 0
           00:00:08     0     0                 0                 0
           00:00:09     1     1                 0                 0
           00:00:10     1     0                 0                 0
           00:00:11     0     1                 0                 0
           00:00:12     1     0                 0                 0
           00:00:13     0     1                 0                 1
           00:00:14     0     1                 0                 1
           00:00:15     0     1                 0                 1
           00:00:16     1     1                 0                 1
           00:00:17     0     0                 0                 0
           00:00:18     0     1                 0                 0
           00:00:19     1     0                 0                 0

相关问题 更多 >