在Python中匹配大量文本语料库中的不相邻关键词

Question

我需要在一大堆文本中匹配不相邻的关键词（有几千个文本）。如果匹配成功，就给它打个标签；如果没有匹配成功，就打个“未知”的标签。

举个例子，我想在下面这段文字中找到关键词销售代表和处理，并把它们归为关键词模式A：

文本：“销售代表处理了一切。知道他为我整理了最佳选择真是太有帮助了。”

所以关键词模式就是销售代表和处理

因为“销售代表”也可能被称为“销售员”或“客户代表”，所以我需要匹配多个关键词。对于“处理”这个词也是如此。所以你能理解这变得复杂的原因。

有很多方法可以找到和匹配单个词或相邻的词（n-grams）。我自己也实现过这个功能。现在我需要找到那些不在一起写的不同关键词，并给它们打标签。而且，我不知道这些关键词之间写的是什么，可能是什么都可以。

我正在用一种词汇的方法来查找关键词，使用一个字典，里面有不同的列来匹配单个关键词、两个关键词或三个关键词。请注意，关键词总是单个词或两个词组合。同时，我不知道关键词之间写的是什么。下面是我写的一些代码。

import pandas as pd 

#creat mock dictionary
Dict = pd.DataFrame({'word1':['dealt','dealt','dealt',''],
                     'word2':['sales representative','sales rep', 'customer rep', 'options']
                      }  )

#create sample text 
texts = ["The sales representative dealt with everything.",
"The sales rep dealt with everything.",
"The agent answered all questions" ,
"The customer rep answered all questions.",
"The agent dealt with everything."]

motive =[]
# only checks for the keyword in the first column  
for item in texts:
    item = str(item)
    if any(x in item for x in Dict['word1']):
    motive.append('keyword pattern A')        
    else:
        motive.append('unkown')

只有当处理和销售代表都出现在文本中时，才应该打标签。所以第3句和第5句的标签是错误的。因此我更新了代码。代码运行是没问题的，但没有给任何标签。

for item in texts:
    #convert into string
    item = str(item)
    #check if keyword can be found in first column
    tempM1 = {x for x in Dict['word1'] if x in item}
    #check if keyword was found
    if tempM1 != None:
        #if yes, locate all of their positions in the dictionary 
        for i in tempM1:
            i = -1
            #get row index 
            ind = Dict.index[Dict['word1'] == list(tempM1)[i+1]] 
    #gives pandas.core.indexes.base.Index            
    #check if column next to given row index is no empty             
            if pd.isnull(Dict['word2'].iloc[ind]) is False:
                #match keyword in second column
                tempM2 = {x for x in Dict['word2'] if x in item}
                #if second keyword was found
                if tempM2 != None: 
                    motive.append('keyword pattern A')
                else: 
            #check again first keyword column
                    tempM3 = {x for x in Dict['word1'] if x in item}
                    if tempM3 != None:
                        motive.append('keyword pattern A')
                    else: 
                        motive.append('unknown')

怎么调整上面的代码呢？

我知道正则表达式（RegEx）。我觉得这可能需要更多的代码行，而且考虑到关键词的数量（大约700到1000个）和它们之间的组合，效率可能会降低。不过，如果我错了也很乐意接受！

另外，我知道这可以被看作一个分类问题。项目需要解释和透明，所以深度学习之类的方案不适合。出于同样的原因，我也不考虑使用嵌入技术。

谢谢！

正则表达式文本处理关键词匹配文本分析不相邻关键词标签分类关键词模式词汇匹配

在Python中匹配大量文本语料库中的不相邻关键词

1 个回答

撰写回答