打开文件并读取senten

"He said, 'I'll pay you five pounds a week if I can have it on my own terms.' I'm a poor woman, sir, and Mr. Warren earns little, and the money meant much to me. He took out a ten-pound note, and he held it out to me then and there.

谢谢你的关注

当前的问题：

file_to_read = 'test.txt' with open(file_to_read) as f: text = f.read() import re word_list = ['Mrs.', 'Mr.'] for i in word_list: text = re.sub(i, i[:-1], text)

我得到的（在测试用例中）是太太改成了先生，而先生只是先生。我试过其他几件事，但似乎没用。答案可能很简单，但我错过了

2条回答

网友

1楼 · 编辑于 2024-05-29 05:09:52

如果您这样做，您的regex将处理上面的文本：

with open(filename) as f:
    text = f.read()

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

唯一的问题是，regex在“Mr.”中的点上从上面的文本中分离出来，所以您需要修复/更改它。

解决这一问题的一个办法，虽然不是完美的，是你可以去掉在Mr之后出现的所有点：

text = re.sub(r'(M\w{1,2})\.', r'\1', text) # no for loop needed for this, like there was before

这与“M”匹配，后跟最少1个、最多2个字母数字字符（\w{1,3}），后跟一个点。模式的括号部分被分组并捕获，在替换中它被引用为'\1'（或组1，因为您可以有更多的括号组）。因此，基本上，先生或夫人是匹配的，但只有先生或夫人部分被捕获，然后先生或夫人被替换为捕获的部分，不包括点。

然后：

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

会按你想要的方式工作。

网友

2楼 · 编辑于 2024-05-29 05:09:52

您可能需要尝试text-sentence标记器模块。

从他们的示例代码：

>>> from text_sentence import Tokenizer
>>> t = Tokenizer()
>>> list(t.tokenize("This is first sentence. This is second one!And this is third, is it?"))
[T('this'/sent_start), T('is'), T('first'), T('sentence'), T('.'/sent_end),
 T('this'/sent_start), T('is'), T('second'), T('one'), T('!'/sent_end),
 T('and'/sent_start), T('this'), T('is'), T('third'), T(','/inner_sep),
 T('is'), T('it'), T('?'/sent_end)]

但我从来没有试过，我更喜欢using NLTK/punkt。

谢谢你的关注

相关问题更多 >

编程相关推荐

热门问题

热门文章