python：删除senten中连续重复的单词

s = "Et puis j'obtiens : [voir écran] [voir écran] [voir écran] [voir écran] [voir écran] [voir écran] [voir écran] [voir écran] [voir écran] [voir écran] Donc, ça veut dire que la suite de nombres réels"

from itertools import groupby no_dupes = ([k for k, v in groupby(sent_clean.split())]) # Put the list back together into a sentence groupby_output = ' '.join(no_dupes) print('No duplicates:', groupby_output)

2条回答

网友

1楼 · 编辑于 2024-05-28 23:19:16

你需要一个稍微复杂一点的正则表达式来识别括号中的重复短语：

import re

pat = re.compile(r'(\[[^\]]*\])(?:\s*\1)+')

print(pat.sub(r'\1', s))
# Et puis j'obtiens : [voir écran] Donc, ça veut dire que la suite de nombres réels

(\[[^\]]*\])捕获两个括号之间的任意数量的非]字符，(?:\s*\1)+查找相邻组的重复。然后，我们将组的多个实例替换为一个实例。你知道吗

网友

2楼 · 编辑于 2024-05-28 23:19:16

使用split()也会分割'[voir ecran]'-但是您可以手动分割：

O（n）解决方案遍历字符串一次：

# uses line continuation \
s =  "Et puis j'obtiens : [voir écran] [voir écran] [voir écran]" \
    "[voir écran] [voir écran] [voir écran] [voir écran]" \
    "[voir écran] [voir écran] [voir écran] Donc, ça veut" \
    "dire que la suite de nombres réels"

seen = set()
result = []
tmp = []
for c in s:
    if tmp and c == "]":
        tmp.append(c)
        tmp = ''.join(tmp)
        if tmp not in seen:
            result.append(tmp)
            seen.add(tmp)
        tmp = []
    elif tmp:
        tmp.append(c)
    elif not tmp and c == "[":
        tmp.append(c)
    else:
        result.append(c)

if tmp and tmp not in seen:
    result.append(tmp)
    seen.add(tmp)
    tmp = []

s_after = ''.join(result)
print(s_after)

输出：

Et puis j'obtiens : [voir écran]          Donc, ça veut dire que la suite de nombres réels

多个空格不会从结果中移除-之后需要执行此操作。你知道吗

遍历字符串-将每个字符添加到一个列表中，直到达到[。然后将所有字符收集到tmp，直到达到]。你join它，并检查你的seen设置如果你已经添加了它-如果这样做什么也不做，重置tmp-否则添加它，重置tmp。如果以后遇到相同的[...]，则不会添加。你知道吗

继续，直到结束-如果tmp已填充，则添加它。（可能是其中的'[some rest text no bracked'）。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章