我决定,我想采取一个文本,并找出如何接近一些标签在文本中。基本上,这个想法是检查两个人之间的距离是否少于14个单词,如果他们是我们说他们是相关的
我天真的实现是有效的,但只有当人是一个词,因为我迭代的话
text = """At this moment Robert who rises at seven and works before
breakfast came in He glanced at his wife her cheek was
slightly flushed he patted it caressingly What s the
matter my dear he asked She objects to my doing nothing
and having red hair said I in an injured tone Oh of
course he can t help his hair admitted Rose It generally
crops out once in a generation said my brother So does the
nose Rudolf has got them both I must premise that I am going
perforce to rake up the very scandal which my dear Lady
Burlesdon wishes forgotten--in the year 1733 George II
sitting then on the throne peace reigning for the moment and
the King and the Prince of Wales being not yet at loggerheads
there came on a visit to the English Court a certain prince
who was afterwards known to history as Rudolf the Third of Ruritania"""
involved = ['Robert', 'Rose', 'Rudolf the Third',
'a Knight of the Garter', 'James', 'Lady Burlesdon']
# my naive implementation
ws = text.split()
l = len(ws)
for wi,w in enumerate(ws):
# Skip if the word is not a person
if w not in involved:
continue
# Check next x words for any involved person
x = 14
for i in range(wi+1,wi+x):
# Avoid list index error
if i >= l:
break
# Skip if the word is not a person
if ws[i] not in involved:
continue
# Print related
print(ws[wi],ws[i])
现在我想升级这个脚本,以允许多字的名字,如'夫人伯莱斯顿'。我不完全确定什么是最好的方法。欢迎任何提示
您可以首先对文本进行预处理,以便用单个单词id替换
text
中的所有名称。id必须是您不希望在文本中显示为其他单词的字符串。在对文本进行预处理时,可以保留id到名称的映射,以知道哪个名称对应哪个id。这将允许保持当前算法的原样相关问题 更多 >
编程相关推荐