将三元组、二元组和单元组匹配到文本;如果单元组或二元组是已匹配三元组的子串,则跳过;python
main_text 是一个包含句子的列表,这些句子已经进行了词性标注:
main_text = [[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN'), ('likes','VB'),
('tea','NN'), ('and','CC'), ('hats', 'NN')], [('the', 'DT'), ('red','JJ')
('queen', 'NN'), ('hates','VB'),('alice','NN')]]
ngrams_to_match 是一个包含词性标注的三元组(trigram)的列表:
ngrams_to_match = [[('likes','VB'),('tea','NN'), ('and','CC')],
[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')],
[('hates', 'DT'), ('alice', 'JJ'), ('but', 'CC') ],
[('and', 'CC'), ('the', 'DT'), ('rabbit', 'NN')]]
(a) 对于 main_text 中的每个句子,首先检查 ngrams_to_match 中是否有完整的三元组匹配。如果找到匹配的三元组,就返回这个匹配的三元组和句子。
(b) 然后,检查每个三元组的第一个元组(单元组)或前两个元组(双元组)是否在 main_text 中匹配。
(c) 如果单元组或双元组是已经匹配的三元组的子串,就不返回任何东西。否则,返回匹配的双元组或单元组以及句子。
下面是应该得到的输出:
trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
bigram_match = [('hates', 'DT'), ('alice', JJ')], sentence[1]
条件 (b) 给我们提供了 bigram_match。
错误的输出应该是:
trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
bigram_match = [('the', 'DT'), ('mad', 'JJ')] #*bad by condition c
unigram_match = [ [('the', 'DT')] #*bad by condition c
trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
bigram_match = [('likes','VB'),('tea','NN')] #*bad by condition c
unigram_match [('likes', 'VB')]# *bad by condition c
等等。
以下这段很丑的代码在这个简单的例子中运行得还不错。但我在想是否有人有更简洁的方法。
for ngram in ngrams_to_match:
for sentence in main_text:
for tup in sentence:
#we can't be sure that our part-of-speech tagger will
#tag an ngram word and a main_text word the same way, so
#we match the word in the tuple, not the whole tuple
if ngram[0][0] == tup[0]: #if word in the first ngram matches...
unigram_index = sentence.index(tup) #...then this is our index
unigram = (sentence[unigram_index][0]) #save it as a unigram
try:
if sentence[unigram_index+2][0]==ngram[2][0]:
if sentence[unigram_index+2][0]==ngram[2][0]: #match a trigram
trigram = (sentence[unigram_index][0],span[1][0], ngram[2][0])#save the match
print 'heres the trigram-->', sentence,'\n', 'trigram--->',trigram
except IndexError:
pass
if ngram[0][0] == tup[0]:# == tup[0]: #same as above
unigram_index = sentence.index(tup)
if sentence[unigram_index+1][0]==span[1][0]: #get bigram match
bigram = (sentence[unigram_index][0],span[1][0])#save the match
if bigram[0] and bigram[1] in trigram: #no substring matches
pass
else:
print 'heres a sentence-->', sentence,'\n', 'bigram--->', bigram
if unigram in bigram or trigram: #no substring matches
pass
else:
print unigram
1 个回答
1
我尝试用生成器来实现这个功能。发现你的说明里有些地方不太清楚,所以我做了一些假设。
如果单个词(unigram)或两个词组合(bigram)是已经匹配的三个词组合(trigram)的一部分,就不要返回任何结果。 - 这句话有点模糊,不太清楚是指搜索的元素还是已经匹配的元素。这让我开始对
可以调整添加到found
集合中的内容,以便修改被排除的搜索元素。
# assumptions:
# - [('hates','DT'),('alice','JJ'),('but','CC')] is typoed and should be:
# [('hates','VB'),('alice','NN'),('but','CC')]
# - matches can't overlap, matched elements are excluded from further checking
# - bigrams precede unigrams
main_text = [
[('the','DT'),('mad','JJ'),('hatter','NN'),('likes','VB'),('tea','NN'),('and','CC'),('hats','NN')],
[('the','DT'),('red','JJ'),('queen','NN'),('hates','VB'),('alice','NN')]
]
ngrams_to_match = [
[('likes','VB'),('tea','NN'),('and','CC')],
[('the','DT'),('mad','JJ'),('hatter','NN')],
[('hates','VB'),('alice','NN'),('but','CC')],
[('and','CC'),('the','DT'),('rabbit','NN')]
]
def slice_generator(sentence,size=3):
"""
Generate slices through the sentence in decreasing sized windows. If True is sent to the
generator, the elements from the previous window will be excluded from future slices.
"""
sent = list(sentence)
for c in range(size,0,-1):
for i in range(len(sent)):
slice = tuple(sent[i:i+c])
if all(x is not None for x in slice) and len(slice) == c:
used = yield slice
if used:
sent[i:i+size] = [None] * c
def gram_search(text,matches):
tri_bi_uni = set(tuple(x) for x in matches) | set(tuple(x[:2]) for x in matches) | set(tuple(x[:1]) for x in matches)
found = set()
for i, sentence in enumerate(text):
gen = slice_generator(sentence)
send = None
try:
while True:
row = gen.send(send)
if row in tri_bi_uni - found:
send = True
found |= set(tuple(row[:x]) for x in range(1,len(row)))
print "%s_gram_match, sentence[%s] = %r" % (len(row),i,row)
else:
send = False
except StopIteration:
pass
gram_search(main_text,ngrams_to_match)
输出:
3_gram_match, sentence[0] = (('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')) 3_gram_match, sentence[0] = (('likes', 'VB'), ('tea', 'NN'), ('and', 'CC')) 2_gram_match, sentence[1] = (('hates', 'VB'), ('alice', 'NN'))