使用Python提取句子

Question

我想要提取出包含特定单词的完整句子。有没有人能告诉我怎么用Python来实现这个？我之前用了concordance()，但它只会打印出包含这个单词的行。

Answer 1

dutt的回答很不错，我只是想补充几点。

import re

text = "go directly to jail. do not cross go. do not collect $200."
pattern = "\.(?P<sentence>.*?(go).*?)\."
match = re.search(pattern, text)
if match != None:
    sentence = match.group("sentence")

显然，在开始之前，你需要导入正则表达式库（import re）。下面是对这个正则表达式实际作用的详细解释（更多信息可以在Python re库页面找到）。

\. # looks for a period preceding sentence.
(?P<sentence>...) # sets the regex captured to variable "sentence".
.*? # selects all text (non-greedy) until the word "go".

再次强调，库参考页面的链接非常重要。

Answer 2

如果你把每一句话都放在一个字符串里，你可以用find()这个方法来查找你的单词，如果找到了就返回那句话。要是没找到，你也可以用正则表达式，类似这样：

pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, yourwholetext)
if match != None:
    sentence = match.group("sentence")

我没有测试过这个，但大概是这个意思。

我的测试：

import re
text = "muffins are good, cookies are bad. sauce is awesome, veggies too. fmooo mfasss, fdssaaaa."
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, text)
if match != None:
    print match.group("sentence")

Answer 3

给大家一个小提醒：句子断句其实是个挺复杂的事情，关于句号的规则有很多例外，比如“Mr.”或者“Dr.”这些缩写。此外，句子结束时还有各种标点符号。但有些例外的例外（如果下一个单词是大写的，并且不是专有名词，那么“Dr.”也可以用来结束一个句子，比如说）。

如果你对这个话题感兴趣（这是自然语言处理的内容），可以去看看：
自然语言工具包（nltk）的 punkt模块。

使用Python提取句子

3 个回答

撰写回答