有没有方法可以删除字符串中重复和连续的单词/短语？

3条回答

网友

1楼 · 编辑于 2024-05-13 05:36:34

我爱itertools。好像每次我想写东西的时候，itertools已经有了。在本例中，groupby获取一个列表，并将该列表中重复的、连续的项分组到(item_value, iterator_of_those_values)的元组中。在这里使用它就像：

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> ' '.join(item[0] for item in groupby(s.split()))
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu wool'

因此，让我们用一个函数来扩展一下，这个函数返回一个删除了重复值的列表：

^{pr2}$

这对一个词的短语很好，但对较长的短语没有帮助。怎么办？好吧，首先，我们要通过大步跳过原始短语来检查较长的短语：

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return

现在我们在做饭！好啊。所以我们的策略是首先删除所有的单字重复。接下来，我们将删除两个单词的重复项，从偏移量0开始，然后从1开始。然后，从偏移量0、1和2开始，依次类推，直到找到5个单词重复：

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words

综合起来：

from itertools import chain, groupby

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words

a = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .'

b = 'this is a sentence where phrases duplicate . sentence are not prhases .'

print ' '.join(cleanse(a.split(), 5)) == b

网友

2楼 · 编辑于 2024-05-13 05:36:34

你可以使用re模块。在

>>> s = 'foo foo bar bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar'

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar foo bar'

如果要匹配任何连续出现的次数：

^{pr2}$

编辑。最后一个例子的补充。为此你得打电话re.sub公司虽然有重复的短语。所以：

>>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
>>> while re.search(r'\b(.+)(\s+\1\b)+', s):
...   s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
...
>>> s
'this is a sentence where phrases duplicate'

网友

3楼 · 编辑于 2024-05-13 05:36:34

就我个人而言，我不认为我们需要为此使用任何其他模块（尽管我承认其中一些模块非常棒）。我只是通过简单的循环来管理这个问题，首先将字符串转换成一个列表。我在上面列出的所有例子上都试过了。它工作得很好。在

sentence = str(raw_input("Please enter your sentence:\n"))

word_list = sentence.split()

def check_if_same(i,j): # checks if two sets of lists are the same

    global word_list
    next = (2*j)-i   # this gets the end point for the second of the two lists to compare (it is essentially j + phrase_len)
    is_same = False
    if word_list[i:j] == word_list[j:next]:

        is_same = True
        # The line below is just for debugging. Prints lists we are comparing and whether it thinks they are equal or not
        #print "Comparing: " + ' '.join(word_list[i:j]) + " " + ''.join(word_list[j:next]) + " " + str(answer)

    return is_same

phrase_len = 1

while phrase_len <= int(len(word_list) / 2): # checks the sentence for different phrase lengths

    curr_word_index=0

    while curr_word_index < len(word_list): # checks all the words of the sentence for the specified phrase length

        result = check_if_same(curr_word_index, curr_word_index + phrase_len) # checks similarity

        if result == True:
            del(word_list[curr_word_index : curr_word_index + phrase_len]) # deletes the repeated phrase
        else:
            curr_word_index += 1

    phrase_len += 1

print "Answer: " + ' '.join(word_list)

相关问题更多 >

编程相关推荐

热门问题

热门文章