打印给定字符串中的所有可能短语(连续词组合)

7 投票
4 回答
4128 浏览
提问于 2025-04-18 14:49

我正在尝试从一段文本中打印出短语。我希望能够打印出文本中的每个短语,短语的长度从两个单词开始,一直到文本长度允许的最大单词数。我下面写了一个程序,可以打印出最多五个单词的短语,但我还没找到更好的方法来打印出所有可能的短语。

我对短语的定义是:在一个字符串中连续的单词,不管它们的意思。

def phrase_builder(i):
    phrase_length = 4
    phrase_list = []
    for x in range(0, len(i)-phrase_length):
        phrase_list.append(str(i[x]) + " " + str(i[x+1]))
        phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]))
        phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]))
        phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]) + " " + str(i[x+4]))
    return phrase_list

text = "the big fat cat sits on the mat eating a rat"

print phrase_builder(text.split())

这个程序的输出结果是:

['the big', 'the big fat', 'the big fat cat', 'the big fat cat sits',
'big fat', 'big fat cat', 'big fat cat sits', 'big fat cat sits on',
'fat cat', 'fat cat sits', 'fat cat sits on', 'fat cat sits on the',
'cat sits', 'cat sits on', 'cat sits on the', 'cat sits on the mat',
'sits on', 'sits on the', 'sits on the mat', 'sits on the mat eating',
'on the', 'on the mat', 'on the mat eating', 'on the mat eating a',
'the mat', 'the mat eating', 'the mat eating a', 'the mat eating a rat']

我想打印出像 "the big fat cat sits on the mat eating""fat cat sits on the mat eating a rat" 这样的短语等等。

有没有人能给我一些建议呢?

4 个回答

1

你需要有一个系统的方法来列出所有可能的短语。

一种方法是从每个单词开始,然后生成所有以那个单词开头的短语。

def phrase_builder(my_words):
   phrases = []
   for i, word in enumerate(my_words):
     phrases.append(word)
     for nextword in my_words[i+1:]:
        phrases.append(phrases[-1] + " " + nextword)
     # Remove the one-word phrase.
     phrases.remove(word)
   return phrases



text = "the big fat cat sits on the mat eating a rat"

print phrase_builder(text.split())
1

我认为最简单的方法就是遍历一下所有可能的 startend 位置,这些位置在 words 列表中,然后为相应的子列表生成短语:

def phrase_builder(words):
    for start in range(0, len(words)-1):
        for end in range(start+2, len(words)+1):
            yield ' '.join(words[start:end])

text = "the big fat cat sits on the mat eating a rat"
for phrase in phrase_builder(text.split()):
    print phrase

输出:

the big
the big fat
...
the big fat cat sits on the mat eating a rat
...
sits on the mat eating a
...
eating a rat
a rat
2

首先,你需要找出如何把那四行代码写成一样的方式。与其手动拼接单词和空格,不如使用join这个方法:

phrase_list.append(" ".join(str(i[x+y]) for y in range(2))
phrase_list.append(" ".join(str(i[x+y]) for y in range(3))
phrase_list.append(" ".join(str(i[x+y]) for y in range(4))
phrase_list.append(" ".join(str(i[x+y]) for y in range(5))

如果你对join方法里的写法不太明白,这里有个手动写的方式:

phrase = []
for y in range(2):
    phrase.append(str(i[x+y]))
phrase_list.append(" ".join(phrase))

完成这个后,用一个循环来替代那四行代码就简单多了:

for length in range(2, phrase_length):
    phrase_list.append(" ".join(str(i[x+y]) for y in range(length))

你还可以用其他几种方式来简化这个过程。

首先,i[x+y] for y in range(length)可以用切片更简单地写成:i[x:x+length]

而且我猜i已经是一个字符串列表了,所以可以去掉str的调用。

另外,range默认是从0开始的,所以可以省略这个部分。

顺便说一下,如果你用更有意义的变量名,比如用words代替i,会让你的代码更容易理解。

所以:

def phrase_builder(words):
    phrase_length = 4
    phrase_list = []
    for i in range(len(words) - phrase_length):
        phrase_list.append(" ".join(words[i:i+phrase_length]))
    return phrase_list

现在你的循环简单到可以把它变成一个列表推导式,整个代码就变成一行了:

def phrase_builder(words):
    phrase_length = 4
    return [" ".join(words[i:i+phrase_length]) 
            for i in range(len(words) - phrase_length)]

最后一点:正如@SoundDefense提到的,你确定不想要“eating a rat”吗?它距离文本结尾不到5个单词,但在文本中是个3个单词的短语。

如果你想要这个,只需去掉- phrase_length那部分。

17

简单使用 itertools.combinations 这个工具

from itertools import combinations
text = "the big fat cat sits on the mat eating a rat"
lst = text.split()
for start, end in combinations(range(len(lst)), 2):
    print lst[start:end+1]

输出结果:

['the', 'big']
['the', 'big', 'fat']
['the', 'big', 'fat', 'cat']
['the', 'big', 'fat', 'cat', 'sits']
['the', 'big', 'fat', 'cat', 'sits', 'on']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['big', 'fat']
['big', 'fat', 'cat']
['big', 'fat', 'cat', 'sits']
['big', 'fat', 'cat', 'sits', 'on']
['big', 'fat', 'cat', 'sits', 'on', 'the']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['fat', 'cat']
['fat', 'cat', 'sits']
['fat', 'cat', 'sits', 'on']
['fat', 'cat', 'sits', 'on', 'the']
['fat', 'cat', 'sits', 'on', 'the', 'mat']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['cat', 'sits']
['cat', 'sits', 'on']
['cat', 'sits', 'on', 'the']
['cat', 'sits', 'on', 'the', 'mat']
['cat', 'sits', 'on', 'the', 'mat', 'eating']
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['sits', 'on']
['sits', 'on', 'the']
['sits', 'on', 'the', 'mat']
['sits', 'on', 'the', 'mat', 'eating']
['sits', 'on', 'the', 'mat', 'eating', 'a']
['sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['on', 'the']
['on', 'the', 'mat']
['on', 'the', 'mat', 'eating']
['on', 'the', 'mat', 'eating', 'a']
['on', 'the', 'mat', 'eating', 'a', 'rat']
['the', 'mat']
['the', 'mat', 'eating']
['the', 'mat', 'eating', 'a']
['the', 'mat', 'eating', 'a', 'rat']
['mat', 'eating']
['mat', 'eating', 'a']
['mat', 'eating', 'a', 'rat']
['eating', 'a']
['eating', 'a', 'rat']
['a', 'rat']

撰写回答