遍历列中的数字，获取重复开始时的第一行行号

2 投票

1 回答

56 浏览

提问于 2025-04-14 17:40

在我的数据集中，我需要找出连续出现超过280次的0，并返回这个重复开始的第一行的行号。我正在使用Python 3.11。

示例数据：

差异

或者创建一个示例数据集：

   ACD=[0,5]

   df2 = pd.DataFrame(np.repeat(ACD, 100, axis=0))
   df3=df2.sample(frac=1,axis=1).sample(frac=1).reset_index(drop=True)

到目前为止，我的代码：

c=[]
for values,row in df.loc[:, ['differences']].iterrows():
        i=0
        while row['differences']  == 0:
            count = sum(1 for i in row)
            i +=1
            if count > 280:
                continue
            c.append(np.where(row['differences']))
        else:
            values+=1

预期的输出：

row_number_rep= [5,90,120] #showing the specific row numbers where the repetition stars.

使用这段代码时，我遇到了一个错误：

<stdin>:8: DeprecationWarning: 调用0维数组的非零元素已被弃用，因为它的行为有些令人惊讶。如果想要保持旧的行为，请使用 atleast_1d(arr).nonzero()。

我需要帮助来改进这段代码。我认为问题在于我没有前280个元素都是0，我需要继续遍历整列，以找到所有开始出现280次重复0的行号。

条件筛选数据处理错误调试数组操作行号提取重复数据分析数据集遍历连续值检测

1 个回答

使用pandas的方法

假设我们有一个例子，阈值设为 4（而不是 280）：

df = pd.DataFrame({'differences': [0,0,0,0,0,1,2,0,3,0,0,0,0,0,0,4,0,5]})

    differences
0             0  # 0: first stretch of >4
1             0
2             0
3             0
4             0
5             1
6             2
7             0
8             3
9             0  # 9: second stretch of >4
10            0
11            0
12            0
13            0
14            0
15            4
16            0
17            5

你可以使用 groupby.size 来过滤 groupby.first：

thresh = 4

m = df['differences'].eq(0)
group = (~m).cumsum().to_numpy()

g = df.reset_index()[m].groupby(group[m])
g.size()

out = g['index'].first()[g.size()>thresh].to_numpy()

输出结果是： array([ 0, 9])

使用循环的方法

lst = [0, 0, 0, 0, 0, 1, 2, 0, 3, 0, 0, 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 0, 0]
thresh = 4

start = -1
zeros = False
count = 0
out = []
for i, v in enumerate(lst+[-1]):
    if v==0:
        if not zeros:
            count = 0
            start = i
            zeros = True
        count += 1
        continue
    if count > thresh:
        if zeros:
            out.append(start)
    zeros = False

out
# [0, 9, 18]

`itertools.groupby` 方法：

from itertools import groupby

lst = [0, 0, 0, 0, 0, 1, 2, 0, 3, 0, 0, 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 0, 0]
thesh  = 4

out = [x[0][0] for k,g in groupby(enumerate(lst), key=lambda x: x[1]==0)
       if k and len(x:=list(g))>thresh]
# [0, 9, 18]

回答于 2025-04-14 由 Python大师

分享举报

遍历列中的数字，获取重复开始时的第一行行号

1 个回答

使用pandas的方法

使用循环的方法

itertools.groupby 方法：

撰写回答

`itertools.groupby` 方法：