在Python中匹配大量文本语料库中的不相邻关键词
我需要在一大堆文本中匹配不相邻的关键词(有几千个文本)。如果匹配成功,就给它打个标签;如果没有匹配成功,就打个“未知”的标签。
举个例子,我想在下面这段文字中找到关键词销售代表和处理,并把它们归为关键词模式A:
文本:“销售代表处理了一切。知道他为我整理了最佳选择真是太有帮助了。”
- 所以关键词模式就是销售代表和处理
- 因为“销售代表”也可能被称为“销售员”或“客户代表”,所以我需要匹配多个关键词。对于“处理”这个词也是如此。所以你能理解这变得复杂的原因。
有很多方法可以找到和匹配单个词或相邻的词(n-grams)。我自己也实现过这个功能。现在我需要找到那些不在一起写的不同关键词,并给它们打标签。而且,我不知道这些关键词之间写的是什么,可能是什么都可以。
我正在用一种词汇的方法来查找关键词,使用一个字典,里面有不同的列来匹配单个关键词、两个关键词或三个关键词。请注意,关键词总是单个词或两个词组合。同时,我不知道关键词之间写的是什么。下面是我写的一些代码。
import pandas as pd
#creat mock dictionary
Dict = pd.DataFrame({'word1':['dealt','dealt','dealt',''],
'word2':['sales representative','sales rep', 'customer rep', 'options']
} )
#create sample text
texts = ["The sales representative dealt with everything.",
"The sales rep dealt with everything.",
"The agent answered all questions" ,
"The customer rep answered all questions.",
"The agent dealt with everything."]
motive =[]
# only checks for the keyword in the first column
for item in texts:
item = str(item)
if any(x in item for x in Dict['word1']):
motive.append('keyword pattern A')
else:
motive.append('unkown')
只有当处理和销售代表都出现在文本中时,才应该打标签。所以第3句和第5句的标签是错误的。因此我更新了代码。代码运行是没问题的,但没有给任何标签。
for item in texts:
#convert into string
item = str(item)
#check if keyword can be found in first column
tempM1 = {x for x in Dict['word1'] if x in item}
#check if keyword was found
if tempM1 != None:
#if yes, locate all of their positions in the dictionary
for i in tempM1:
i = -1
#get row index
ind = Dict.index[Dict['word1'] == list(tempM1)[i+1]]
#gives pandas.core.indexes.base.Index
#check if column next to given row index is no empty
if pd.isnull(Dict['word2'].iloc[ind]) is False:
#match keyword in second column
tempM2 = {x for x in Dict['word2'] if x in item}
#if second keyword was found
if tempM2 != None:
motive.append('keyword pattern A')
else:
#check again first keyword column
tempM3 = {x for x in Dict['word1'] if x in item}
if tempM3 != None:
motive.append('keyword pattern A')
else:
motive.append('unknown')
怎么调整上面的代码呢?
我知道正则表达式(RegEx)。我觉得这可能需要更多的代码行,而且考虑到关键词的数量(大约700到1000个)和它们之间的组合,效率可能会降低。不过,如果我错了也很乐意接受!
另外,我知道这可以被看作一个分类问题。项目需要解释和透明,所以深度学习之类的方案不适合。出于同样的原因,我也不考虑使用嵌入技术。
谢谢!
1 个回答
0
你能不能用 all()
和 any()
这两个函数来检查一个短语是否包含“所有”匹配列表中的“任何”匹配项呢?
phrases_to_find = [
[
["dealt"],
["sales representative", "sales rep", "customer rep"]
],
[
["option"]
]
]
texts = [
"The sales representative dealt with everything.",
"The sales rep dealt with everything.",
"The agent answered all questions" ,
"The customer rep answered all questions.",
"The agent dealt with everything.",
"Here is some option."
]
motive =[]
for text in texts:
for index, test_phrases in enumerate(phrases_to_find):
if all(any(p in text for p in phrase) for phrase in test_phrases):
motive.append(f'keyword pattern {index}')
break
else:
motive.append('unknown')
print(motive)
这样做应该能得到:
[
'keyword pattern 0',
'keyword pattern 0',
'unknown',
'unknown',
'unknown',
'keyword pattern 1'
]