从列表中移除特定的连续重复项

4 投票

6 回答

2176 浏览

提问于 2025-04-16 14:28

我有一个字符串列表，内容像这样：

['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']

我想把所有连续的 '**', '**' 替换成一个 '**'，但要保持 'bar', 'bar' 不变。也就是说，把任何连续的 '**' 替换成一个。我的代码现在是这样的：

p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
np = [p[0]]
for pi in range(1,len(p)):
  if p[pi] == '**' and np[-1] == '**':
    continue
  np.append(p[pi])

有没有更符合 Python 风格的方法来做到这一点？

列表操作字符串处理编程风格连续重复项

6 个回答

这里有一个不使用 itertools.groupby() 的解决方案：

p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**',
     'foo', '*','*', 'bar', 'bar','bar', '**', '**','foo','bar',]

def treat(A):
    prec = A[0]; yield prec
    for x in A[1:]:
        if (prec,x)!=('**','**'):  yield x
        prec = x

print p
print
print list(treat(p))

结果

['**', 'foo', '*', 'bar', 'bar', '**', '**', '**',  
 'baz', '**', '**',
 'foo', '*', '*', 'bar', 'bar','bar', '**', '**',
 'foo', 'bar']


['**', 'foo', '*', 'bar', 'bar', '**',
 'baz', '**',
 'foo', '*', '*', 'bar', 'bar', 'bar', '**',
 'foo', 'bar']

这是另一个解决方案，灵感来自 dugres。

from itertools import groupby

p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**',
     'foo', '*','*', 'bar', 'bar','bar', '**', '**','foo','bar',]

res = []
for k, g in groupby(p):
    res.extend(  ['**'] if k=='**' else list(g) )    
print res

这个方法和 Tom Zych 的解决方案类似，但更简单。

编辑

p = ['**','**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**',
     'foo', '*','*', 'bar', 'bar','bar', '**', '**','foo','bar', '**', '**', '**']


q= ['**',12,'**',45, 'foo',78, '*',751, 'bar',4789, 'bar',3, '**', 5,'**',7, '**',
    73,'baz',4, '**',8, '**',20,'foo', 8,'*',36,'*', 36,'bar', 11,'bar',0,'bar',9,
    '**', 78,'**',21,'foo',27,'bar',355, '**',33, '**',37, '**','end']

def treat(B,dedupl):
    B = iter(B)
    prec = B.next(); yield prec
    for x in B:
        if not(prec==x==dedupl):  yield x
        prec = x

print 'gen = ( x for x in q[::2])'
gen = ( x for x in q[::2])
print 'list(gen)==p is ',list(gen)==p
gen = ( x for x in q[::2])
print 'list(treat(gen)==',list(treat(gen,'**'))

ch = '??h4i4???4t4y?45l????hmo4j5???'
print '\nch==',ch
print "''.join(treat(ch,'?'))==",''.join(treat(ch,'?'))

print "\nlist(treat([],'%%'))==",list(treat([],'%%'))

结果

gen = ( x for x in q[::2])
list(gen)==p is  True
list(treat(gen)== ['**', 'foo', '*', 'bar', 'bar', '**', 'baz', '**', 'foo', '*', '*', 'bar', 'bar', 'bar', '**', 'foo', 'bar', '**']

ch== ??h4i4???4t4y?45l????hmo4j5???
''.join(treat(ch,'?'))== ?h4i4?4t4y?45l?hmo4j5?

list(treat([],'%%'))== []

备注：生成器函数可以根据输入的类型调整输出，只需在调用生成器时进行一些处理，不需要修改生成器函数的内部代码；

而 Tom Zynch 的解决方案就不太容易适应不同的输入类型。

编辑 2

我在寻找一种一行代码的方法，使用列表推导式或生成器表达式。

我找到了两种方法，我认为不使用 groupby() 是不可能的。

from itertools import groupby
from operator import concat

p = ['**', '**','foo', '*', 'bar', 'bar', '**', '**', '**',
     'bar','**','foo','sun','sun','sun']
print 'p==',p,'\n'

dedupl = ("**",'sun')
print 'dedupl==',repr(dedupl)

print [ x for k, g in groupby(p) for x in ((k,) if k in dedupl else g) ]

# or

print reduce(concat,( [k] if k in dedupl else list(g) for k, g in groupby(p)),[])

基于相同的原则，可以很容易地将 dugres 的函数转换为生成器函数：

from itertools import groupby

def compress(iterable, to_compress):
    for k, g in groupby(iterable):
        if k in to_compress:
            yield k
        else:
            for x in g: yield x

不过，这个生成器函数有两个缺点：

它使用了 groupby() 函数，这对不熟悉 Python 的人来说不太容易理解。
它的执行时间比我自己的生成器函数 treat() 和 John Machin 的生成器函数要长，这两个函数都没有使用 groupby()。

我稍微修改了一下它们，使它们能够接受需要去重的项目序列，并测量了执行时间：

from time import clock
from itertools import groupby

def squeeze(iterable, victims, _dummy=object()):
    if hasattr(iterable, '__iter__') and not hasattr(victims, '__iter__'):
        victims = (victims,)
    previous = _dummy
    for item in iterable:
        if item in victims and item==previous:
            continue
        previous = item
        yield item

def treat(B,victims):
    if hasattr(B, '__iter__') and not hasattr(victims, '__iter__'):
        victims = (victims,)
    B = iter(B)
    prec = B.next(); yield prec
    for x in B:
        if x  not in victims or x!=prec:  yield x
        prec = x

def compress(iterable, to_compress):
    if hasattr(iterable, '__iter__') and not hasattr(to_compress, '__iter__'):
        to_compress = (to_compress,)
    for k, g in groupby(iterable):
        if k in to_compress:
            yield k
        else:
            for x in g: yield x

p = ['**', '**','su','foo', '*', 'bar', 'bar', '**', '**', '**',
     'su','su','**','bin', '*','*','bar','bar','su','su','su']

n = 10000

te = clock()
for i in xrange(n):
    a = list(compress(p,('**','sun')))
print clock()-te,'  generator function with groupby()'

te = clock()
for i in xrange(n):
    b = list(treat(p,('**','sun')))
print clock()-te,'  generator function eyquem'


te = clock()
for i in xrange(n):
    c = list(squeeze(p,('**','sun')))
print clock()-te,'  generator function John Machin'

print p
print 'a==b==c is ',a==b==c
print a

这条指令

if hasattr(iterable, '__iter__') and not hasattr(to_compress, '__iter__'):
    to_compress = (to_compress,)

是必要的，以避免当可迭代参数是一个序列而另一个参数只是一个字符串时出现错误：后者需要被修改为一个容器，前提是可迭代参数本身不是字符串。

这是基于这样的事实：像元组、列表、集合等序列都有 iter 方法，但字符串没有。以下代码展示了这个问题：

def compress(iterable, to_compress):
    if hasattr(iterable, '__iter__') and not hasattr( to_compress, '__iter__'):
        to_compress = (to_compress,)
    print 't_compress==',repr(to_compress)
    for k, g in groupby(iterable):
        if k in to_compress:
            yield k
        else:
            for x in g: yield x


def compress_bof(iterable, to_compress):
    if not hasattr(to_compress, '__iter__'): # to_compress is a string
        to_compress = (to_compress,)
    print 't_compress==',repr(to_compress)
    for k, g in groupby(iterable):
        if k in to_compress:
            yield k
        else:
            for x in g: yield x


def compress_bug(iterable, to_compress_bug):
    print 't_compress==',repr(to_compress_bug)
    for k, g in groupby(iterable):
        #print 'k==',k,k in to_compress_bug
        if k in to_compress_bug:
            yield k
        else:
            for x in g: yield x


q = ';;;htr56;but78;;;;$$$$;ios4!'
print 'q==',q
dedupl = ";$"
print 'dedupl==',repr(dedupl)
print

print "''.join(compress    (q,"+repr(dedupl)+")) :\n",''.join(compress    (q,dedupl))+\
      ' <-CORRECT ONE'
print
print "''.join(compress_bof(q,"+repr(dedupl)+")) :\n",''.join(compress_bof(q,dedupl))+\
      '  <====== error ===='
print
print "''.join(compress_bug(q,"+repr(dedupl)+")) :\n",''.join(compress_bug(q,dedupl))

print '\n\n\n'


q = [';$', ';$',';$','foo', ';', 'bar','bar',';',';',';','$','$','foo',';$12',';$12']
print 'q==',q
dedupl = ";$12"
print 'dedupl==',repr(dedupl)
print
print 'list(compress    (q,'+repr(dedupl)+')) :\n',list(compress    (q,dedupl)),\
      ' <-CORRECT ONE'
print
print 'list(compress_bof(q,'+repr(dedupl)+')) :\n',list(compress_bof(q,dedupl))
print
print 'list(compress_bug(q,'+repr(dedupl)+')) :\n',list(compress_bug(q,dedupl)),\
      '  <====== error ===='
print

结果

q== ;;;htr56;but78;;;;$$$$;ios4!
dedupl== ';$'

''.join(compress    (q,';$')) :
t_compress== ';$'
;htr56;but78;$;ios4! <-CORRECT ONE

''.join(compress_bof(q,';$')) :
t_compress== (';$',)
;;;htr56;but78;;;;$$$$;ios4!  <====== error ====

''.join(compress_bug(q,';$')) :
t_compress== ';$'
;htr56;but78;$;ios4!




q== [';$', ';$', ';$', 'foo', ';', 'bar', 'bar', ';', ';', ';', '$', '$', 'foo', ';$12', ';$12']
dedupl== ';$12'

list(compress    (q,';$12')) :
t_compress== (';$12',)
[';$', ';$', ';$', 'foo', ';', 'bar', 'bar', ';', ';', ';', '$', '$', 'foo', ';$12']  <-CORRECT ONE

list(compress_bof(q,';$12')) :
t_compress== (';$12',)
[';$', ';$', ';$', 'foo', ';', 'bar', 'bar', ';', ';', ';', '$', '$', 'foo', ';$12']

list(compress_bug(q,';$12')) :
t_compress== ';$12'
[';$', 'foo', ';', 'bar', 'bar', ';', '$', 'foo', ';$12']   <====== error ====

我得到了以下执行时间：

0.390163274941   generator function with groupby()
0.324547114228   generator function eyquem
0.310176572721   generator function John Machin
['**', '**', 'su', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'su', 'su', '**', 'bin', '*', '*', 'bar', 'bar', 'su', 'su', 'su']
a==b==c is  True
['**', 'su', 'foo', '*', 'bar', 'bar', '**', 'su', 'su', '**', 'bin', '*', '*', 'bar', 'bar', 'su', 'su', 'su']

我更喜欢 John Machin 的解决方案，因为它没有像我的那样需要指令 B = iter(B)。

但是指令 previous = _dummy 和 _dummy = object() 对我来说显得有些奇怪。所以我最终认为更好的解决方案是以下代码，它即使在字符串作为可迭代参数时也能工作，并且第一个定义的对象 previous 不是一个假对象：

def squeeze(iterable, victims):
    if hasattr(iterable, '__iter__') and not hasattr(victims, '__iter__'):
        victims = (victims,)
    for item in iterable:
        previous = item
        break
    for item in iterable:
        if item in victims and item==previous:
            continue
        previous = item
        yield item

编辑 3

我之前理解为 object() 被用作哨兵。

但我对 object 被调用这件事感到困惑。昨天，我在想 object 是一种如此特殊的东西，以至于不可能在传递给 squeeze() 的任何可迭代对象中出现 object。所以，我在想你为什么要调用它，John Machin，这让我对它的性质产生了怀疑；这就是我为什么要确认 object 是超级元类。

但今天，我想我明白了你在代码中调用 object 的原因。

实际上，object 可能出现在一个可迭代对象中，为什么不呢？超级元类 object 是一个对象，所以没有什么能阻止它在去重处理之前被放入可迭代对象中，谁知道呢？因此，使用 object 本身作为哨兵是不正确的做法。

所以你没有使用 object，而是使用了一个 object() 的实例作为哨兵。

但我在想，为什么要选择这个神秘的东西，即调用 object 的返回值是什么？

我对此进行了思考，并注意到这可能是调用的原因：

调用 object 会创建一个实例，因为 object 是 Python 中最基本的类，每次创建一个实例时，它都是一个与之前创建的实例不同的对象，其值总是与之前任何 object 实例的值不同：

a = object()
b = object()
c = object()
d = object()

print id(a),'\n',id(b),'\n',id(c),'\n',id(d)

print a==b,a==c,a==d
print b==c,b==d,c==d

结果

10818752 
10818760 
10818768 
10818776
False False False
False False False

所以可以肯定的是 _dummy=object() 是一个唯一的对象，具有唯一的 id 和唯一的值。顺便问一下，我想知道一个 object 实例的值是什么。无论如何，以下代码展示了使用 _dummy=object 的问题，以及使用 _dummy=object() 时没有问题：

def imperfect_squeeze(iterable, victim, _dummy=object):
    previous = _dummy
    print 'id(previous)   ==',id(previous)
    print 'id(iterable[0])==',id(iterable[0])
    for item in iterable:
        if item in victim and item==previous:  continue
        previous = item; yield item

def squeeze(iterable, victim, _dummy=object()):
    previous = _dummy
    print 'id(previous)   ==',id(previous)
    print 'id(iterable[0])==',id(iterable[0])
    for item in iterable:
        if item in victim and item==previous:  continue
        previous = item; yield item

wat = object
li = [wat,'**','**','foo',wat,wat]
print 'imperfect_squeeze\n''li before ==',li
print map(id,li)
li = list(imperfect_squeeze(li,[wat,'**']))
print 'li after  ==',li
print


wat = object()
li = [wat,'**','**','foo',wat,wat]
print 'squeeze\n''li before ==',li
print map(id,li)
li = list(squeeze(li,[wat,'**']))
print 'li after  ==',li
print


li = [object(),'**','**','foo',object(),object()]
print 'squeeze\n''li before ==',li
print map(id,li)
li = list(squeeze(li,[li[0],'**']))
print 'li after  ==',li

结果

imperfect_squeeze
li before == [<type 'object'>, '**', '**', 'foo', <type 'object'>, <type 'object'>]
[505317320, 18578968, 18578968, 13208848, 505317320, 505317320]
id(previous)   == 505317320
id(iterable[0])== 505317320
li after  == ['**', 'foo', <type 'object'>]

squeeze
li before == [<object object at 0x00A514C8>, '**', '**', 'foo', <object object at 0x00A514C8>, <object object at 0x00A514C8>]
[10818760, 18578968, 18578968, 13208848, 10818760, 10818760]
id(previous)   == 10818752
id(iterable[0])== 10818760
li after  == [<object object at 0x00A514C8>, '**', 'foo', <object object at 0x00A514C8>]

squeeze
li before == [<object object at 0x00A514D0>, '**', '**', 'foo', <object object at 0x00A514D8>, <object object at 0x00A514E0>]
[10818768, 18578968, 18578968, 13208848, 10818776, 10818784]
id(previous)   == 10818752
id(iterable[0])== 10818768
li after  == [<object object at 0x00A514D0>, '**', 'foo', <object object at 0x00A514D8>, <object object at 0x00A514E0>]

问题在于在通过 imperfect_squeeze() 处理后，列表的第一个元素缺少 <type 'object'> 。

不过，我们必须注意，“问题”只在列表的第一个元素是 object 时才会出现：这对于如此小的概率来说有点过多的思考……但一个严谨的程序员会考虑到所有情况。

如果我们使用 list，而不是 object，结果会有所不同：

def imperfect_sqlize(iterable, victim, _dummy=list):
    previous = _dummy
    print 'id(previous)   ==',id(previous)
    print 'id(iterable[0])==',id(iterable[0])
    for item in iterable:
        if item in victim and item==previous:  continue
        previous = item; yield item

def sqlize(iterable, victim, _dummy=list()):
    previous = _dummy
    print 'id(previous)   ==',id(previous)
    print 'id(iterable[0])==',id(iterable[0])
    for item in iterable:
        if item in victim and item==previous:  continue
        previous = item; yield item

wat = list
li = [wat,'**','**','foo',wat,wat]
print 'imperfect_sqlize\n''li before ==',li
print map(id,li)
li = list(imperfect_sqlize(li,[wat,'**']))
print 'li after  ==',li
print

wat = list()
li = [wat,'**','**','foo',wat,wat]
print 'sqlize\n''li before ==',li
print map(id,li)
li = list(sqlize(li,[wat,'**']))
print 'li after  ==',li
print

li = [list(),'**','**','foo',list(),list()]
print 'sqlize\n''li before ==',li
print map(id,li)
li = list(sqlize(li,[li[0],'**']))
print 'li after  ==',li

结果

imperfect_sqlize
li before == [<type 'list'>, '**', '**', 'foo', <type 'list'>, <type 'list'>]
[505343304, 18578968, 18578968, 13208848, 505343304, 505343304]
id(previous)   == 505343304
id(iterable[0])== 505343304
li after  == ['**', 'foo', <type 'list'>]

sqlize
li before == [[], '**', '**', 'foo', [], []]
[18734936, 18578968, 18578968, 13208848, 18734936, 18734936]
id(previous)   == 18734656
id(iterable[0])== 18734936
li after  == ['**', 'foo', []]

sqlize
li before == [[], '**', '**', 'foo', [], []]
[18734696, 18578968, 18578968, 13208848, 18735016, 18734816]
id(previous)   == 18734656
id(iterable[0])== 18734696
li after  == ['**', 'foo', []]

在 Python 中，是否还有其他对象具有这种特性？

John Machin，你为什么选择 object 的实例作为你生成器函数中的哨兵？你之前知道这个特性吗？

回答于 2025-04-16 由 Python大师

分享举报

在我看来，这种写法很符合Python的风格。

result = [v for i, v in enumerate(L) if L[i:i+2] != ["**", "**"]]

这里唯一的“技巧”就是，当 i == len(L)-1 时，L[i:i+2] 这个表达式会返回一个只包含一个元素的列表。

当然，实际上同样的写法也可以用作生成器。

回答于 2025-04-16 由 Python大师

分享举报

不太确定这是否符合“pythonic”的风格，但这样写应该能正常工作，而且更简洁：

star_list = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
star_list = [i for i, next_i in zip(star_list, star_list[1:] + [None]) 
             if (i, next_i) != ('**', '**')]

上面的代码会把列表复制两次；如果你想避免这种情况，可以考虑Tom Zych的方法。或者，你也可以这样做：

from itertools import islice, izip, chain

star_list = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
sl_shift = chain(islice(star_list, 1, None), [None])
star_list = [i for i, next_i in izip(star_list, sl_shift) 
             if (i, next_i) != ('**', '**')]

这个方法可以进一步推广，并且更适合迭代器使用——更重要的是，它的可读性也更高——可以参考itertools文档中的pairwise的变体：

from itertools import islice, izip, chain, tee
def compress(seq, x):
    seq, shift = tee(seq)
    shift = chain(islice(shift, 1, None), (object(),))
    return (i for i, j in izip(seq, shift) if (i, j) != (x, x))

经过测试：

>>> list(compress(star_list, '**'))
['**', 'foo', '*', 'bar', 'bar', '**', 'baz']

回答于 2025-04-16 由 Python大师

分享举报

从列表中移除特定的连续重复项

6 个回答

编辑

编辑 2

编辑 3

撰写回答