如何使用传统python或pandas/numpy/scy在列表中按顺序选择重复项的第一次出现

2024-05-08 19:46:53 发布

您现在位置:Python中文网/ 问答频道 /正文

假设有一个列表“series”,在几个索引值处有一些重复的元素。有没有办法找到一个数的重复序列的第一次出现。你知道吗

series = [2,3,7,10,11,16,16,9,11,12,14,16,16,16,5,7,9,17,17,4,8,18,18]

返回值应类似于[5,11,17,21],这是重复序列[16,16]、[16,16,16]、[17,17]和[18,18]的第一次出现的索引值


Tags: 元素列表序列series返回值办法
3条回答

首先通过shiftcumsum创建唯一组,然后获取第一个重复项的掩码并通过^{}进行筛选:

s = pd.Series([2,3,7,10,11,16,16,9,11,12,14,16,16,16,5,7,9,17,17,4,8,18,18])

s1 = s.shift(1).ne(s).cumsum()
m = ~s1.duplicated() & s1.duplicated(keep=False)
s2 = m.index[m].tolist()
print (s2)
[5, 11, 17, 21]

print (s1)
0      1
1      2
2      3
3      4
4      5
5      6
6      6
7      7
8      8
9      9
10    10
11    11
12    11
13    11
14    12
15    13
16    14
17    15
18    15
19    16
20    17
21    18
22    18
dtype: int32

print (m)
dtype: int32
0     False
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11     True
12    False
13    False
14    False
15    False
16    False
17     True
18    False
19    False
20    False
21     True
22    False
dtype: bool

你可以用shift

In [3815]: s = pd.Series(series)

In [3816]: cond = (s == s.shift(-1))

In [3817]: cond.index[cond]
Out[3817]: Int64Index([5, 11, 12, 17, 21], dtype='int64')

或者,diff

In [3828]: cond = s.diff(-1).eq(0)

In [3829]: cond.index[cond]
Out[3829]: Int64Index([5, 11, 12, 17, 21], dtype='int64')

对于列表输出,使用tolist

In [3833]: cond.index[cond].tolist()
Out[3833]: [5, 11, 12, 17, 21]

详细信息

In [3823]: s.head(10)
Out[3823]:
0     2
1     3
2     7
3    10
4    11
5    16
6    16
7     9
8    11
9    12
dtype: int64

In [3824]: cond.head(10)
Out[3824]:
0    False
1    False
2    False
3    False
4    False
5     True
6    False
7    False
8    False
9    False
dtype: bool

下面是一个使用数组切片来提高性能的方法,类似于^{},但没有任何附加/串联-

a = np.array(series)
out = np.flatnonzero((a[2:] == a[1:-1]) & (a[1:-1] != a[:-2]))+1

样本运行-

In [28]: a = np.array(series)

In [29]: np.flatnonzero((a[2:] == a[1:-1]) & (a[1:-1] != a[:-2]))+1
Out[29]: array([ 5, 11, 17, 21])

运行时测试(用于工作解决方案)

接近-

def piRSquared1(series):
    d = np.flatnonzero(np.diff(series) == 0)
    w = np.append(True, np.diff(d) > 1)
    return d[w].tolist()

def piRSquared2(series):
    s = np.array(series)
    return np.flatnonzero(
        np.append(s[:-1] == s[1:], True) &
        np.append(True, s[1:] != s[:-1])
    ).tolist()

def Zach(series):
    s = pd.Series(series)
    i = [g.index[0] for _, g in s.groupby((s != s.shift()).cumsum()) if len(g) > 1]
    return i

def jezrael(series):
    s = pd.Series(series)
    s1 = s.shift(1).ne(s).cumsum()
    m = ~s1.duplicated() & s1.duplicated(keep=False)
    s2 = m.index[m].tolist()
    return s2    

def divakar(series):
    a = np.array(series)
    x = a[1:-1]
    return (np.flatnonzero((a[2:] == x) & (x != a[:-2]))+1).tolist()

对于设置,我们只是将示例输入平铺多次。你知道吗

计时-

案例1:大套

In [34]: series0 = [2,3,7,10,11,16,16,9,11,12,14,16,16,16,5,7,9,17,17,4,8,18,18]

In [35]: series = np.tile(series0,10000).tolist()

In [36]: %timeit piRSquared1(series)
    ...: %timeit piRSquared2(series)
    ...: %timeit Zach(series)
    ...: %timeit jezrael(series)
    ...: %timeit divakar(series)
    ...: 
100 loops, best of 3: 8.06 ms per loop
100 loops, best of 3: 7.79 ms per loop
1 loop, best of 3: 3.88 s per loop
10 loops, best of 3: 24.3 ms per loop
100 loops, best of 3: 7.97 ms per loop

案例2:更大的集合(在前两个解决方案上)

In [40]: series = np.tile(series0,1000000).tolist()

In [41]: %timeit piRSquared2(series)
1 loop, best of 3: 823 ms per loop

In [42]: %timeit divakar(series)
1 loop, best of 3: 823 ms per loop

现在,这两种解决方案的区别仅仅在于后一种方法避免了附加。让我们仔细看看它们,在一个较小的数据集上运行-

In [43]: series = np.tile(series0,100).tolist()

In [44]: %timeit piRSquared2(series)
10000 loops, best of 3: 89.4 µs per loop

In [45]: %timeit divakar(series)
10000 loops, best of 3: 82.8 µs per loop

因此,它揭示了后一种解决方案中的连接/附加避免在处理较小的数据集时有很大帮助,但是在更大的数据集上,它们变得具有可比性。你知道吗

在较大的数据集上进行一次连接就可以实现边际改进。因此,最后一步可以重写为:

np.flatnonzero(np.concatenate(([False],(a[2:] == a[1:-1]) & (a[1:-1] != a[:-2]))))

相关问题 更多 >