在两个文件中找到唯一的句子

text ='Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry.' text1 ='Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry.' import difflib differ = difflib.Differ() diff = differ.compare(text,text1) print '\n'.join(diff)

2条回答

网友

1楼 · 编辑于 2024-05-23 21:19:25

正如DZinoviev上面所述，您将字符串传递到一个需要列表的函数中。您不需要使用NLTK，相反，您可以通过在句点上拆分来将字符串转换为句子列表。在

import difflib

text1 ="""Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry."""
text2 ="""Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry."""

list1 = list(text1.split("."))
list2 = list(text2.split("."))

differ = difflib.Differ()
diff = differ.compare(list1,list2)
print "\n".join(diff)

网友

2楼 · 编辑于 2024-05-23 21:19:25

首先，Differ（）.compare（）比较的是行，而不是句子。在

第二，它实际上比较序列，比如字符串列表。但是，传递的是两个字符串，而不是两个字符串列表。由于字符串也是一个（字符）序列，因此在您的示例中Differ（）.compare（）将比较各个字符。在

如果你想用句子比较文件，你必须准备两个句子列表。你可以用nltk.sent_标记化（文本）将字符串拆分成句子。在

diff = differ.compare(nltk.sent_tokenize(text),nltk.sent_tokenize(text1))
print('\n'.join(diff))
#  Physics is one of the oldest academic disciplines.
#- Perhaps the oldest through its inclusion of astronomy.
#- Over the last two millennia.
#  Physics was a part of natural philosophy along with chemistry.
#+ Quantum chemistry is a branch of chemistry.

相关问题更多 >

编程相关推荐

热门问题

热门文章