将三元组、二元组和单元组匹配到文本；如果单元组或二元组是已匹配三元组的子串，则跳过；python

Question

main_text 是一个包含句子的列表，这些句子已经进行了词性标注：

 main_text = [[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN'), ('likes','VB'),    
              ('tea','NN'), ('and','CC'), ('hats', 'NN')], [('the', 'DT'), ('red','JJ')                   
               ('queen', 'NN'), ('hates','VB'),('alice','NN')]]

ngrams_to_match 是一个包含词性标注的三元组（trigram）的列表：

 ngrams_to_match = [[('likes','VB'),('tea','NN'), ('and','CC')],
                    [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')],
                    [('hates', 'DT'), ('alice', 'JJ'), ('but', 'CC') ],
                    [('and', 'CC'), ('the', 'DT'), ('rabbit', 'NN')]]

(a) 对于 main_text 中的每个句子，首先检查 ngrams_to_match 中是否有完整的三元组匹配。如果找到匹配的三元组，就返回这个匹配的三元组和句子。

(b) 然后，检查每个三元组的第一个元组（单元组）或前两个元组（双元组）是否在 main_text 中匹配。

(c) 如果单元组或双元组是已经匹配的三元组的子串，就不返回任何东西。否则，返回匹配的双元组或单元组以及句子。

下面是应该得到的输出：

 trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
 trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
 bigram_match = [('hates', 'DT'), ('alice', JJ')], sentence[1]

条件 (b) 给我们提供了 bigram_match。

错误的输出应该是：

 trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
 bigram_match =  [('the', 'DT'), ('mad', 'JJ')] #*bad by condition c
 unigram_match = [ [('the', 'DT')] #*bad by condition c
 trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
 bigram_match = [('likes','VB'),('tea','NN')] #*bad by condition c
 unigram_match [('likes', 'VB')]# *bad by condition c

等等。

以下这段很丑的代码在这个简单的例子中运行得还不错。但我在想是否有人有更简洁的方法。

 for ngram in ngrams_to_match:
  for sentence in main_text:
        for tup in sentence:

            #we can't be sure that our part-of-speech tagger will
            #tag an ngram word and a main_text word the same way, so 
            #we match the word in the tuple, not the whole tuple

        if ngram[0][0] == tup[0]: #if word in the first ngram matches...
            unigram_index = sentence.index(tup) #...then this is our index
            unigram = (sentence[unigram_index][0]) #save it as a unigram

            try:   
                        if sentence[unigram_index+2][0]==ngram[2][0]:
                 if sentence[unigram_index+2][0]==ngram[2][0]:  #match a trigram
                      trigram = (sentence[unigram_index][0],span[1][0], ngram[2][0])#save the match
                      print 'heres the trigram-->', sentence,'\n', 'trigram--->',trigram
            except IndexError:
            pass
            if ngram[0][0] == tup[0]:# == tup[0]:  #same as above
                unigram_index = sentence.index(tup)               
                if sentence[unigram_index+1][0]==span[1][0]:  #get bigram match     

                bigram = (sentence[unigram_index][0],span[1][0])#save the match
                if bigram[0] and bigram[1] in trigram:  #no substring matches
                                     pass                             
                else:
                    print 'heres a sentence-->', sentence,'\n', 'bigram--->', bigram
                if unigram in bigram or trigram:  #no substring matches
                    pass
                else:
                    print unigram

文本处理词性标注句子分析 n-gram 三元组匹配二元组匹配单元组匹配自串检查

将三元组、二元组和单元组匹配到文本；如果单元组或二元组是已匹配三元组的子串，则跳过；python

1 个回答

撰写回答