使用正则表达式捕获1词和2词专有名词
我想到了下面的内容。我把问题缩小到无法同时捕捉到1个词和2个词的专有名词。
(1) 如果我能设置一个条件,让程序在有两个选择时默认选择较长的词,那就太好了。
而且
(2) 如果我能告诉正则表达式,只在字符串以介词开头时考虑这个,比如“On”、“At”或“For”。我试过一些类似的东西,但没有成功:
(^On|^at)([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,5})
我该怎么做1和2呢?
我现在的正则表达式是
r'([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,15})'
我想捕捉到的有:Ashoka、Shift Series、Compass Partners和Kenneth Cole。
#'On its 25th anniversary, Ashoka',
#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole',
3 个回答
1
我建议使用一个自然语言处理工具,在Python中最受欢迎的似乎是nltk。其实,正则表达式并不是处理这个问题的好方法……在nltk网站的首页上有一个例子,之前的回答中也提到过,下面是复制粘贴的内容:
import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)
tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)
现在,entities里包含了根据宾州树库标记的单词。
1
这不是完全正确,但大部分你想要的内容都能匹配到,除了On
这个。
import re
text = """
#'On its 25th anniversary, Ashoka',
#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth
Cole',
"""
proper_noun_regex = r'([A-Z]{1}[a-z]{1,}(\s[A-Z]{1}[a-z]{1,})?)'
p = re.compile(proper_noun_regex)
matches = p.findall(text)
print matches
输出结果:
[('On', ''), ('Ashoka', ''), ('Shift Series', ' Series'), ('Compass Partners', ' Partners'), ('Kenneth Cole', ' Cole')]
然后你可以考虑实现一个过滤器来处理这个列表。
def filter_false_positive(unfiltered_matches):
filtered_matches = []
black_list = ["an","on","in","foo","bar"] #etc
for match in unfiltered_matches:
if match.lower() not in black_list:
filtered_matches.append(match)
return filtered_matches
或者因为Python很酷:
def filter_false_positive(unfiltered_matches):
black_list = ["an","on","in","foo","bar"] #etc
return [match for match in filtered_matches if match.lower() not in black_list]
你可以这样使用它:
# CONTINUED FROM THE CODE ABOVE
matches = [i[0] for i in matches]
matches = filter_false_positive(matches)
print matches
最终输出结果是:
['Ashoka', 'Shift Series', 'Compass Partners', 'Kenneth Cole']
判断一个单词是因为在句子开头而大写,还是因为它是专有名词,这个问题并不简单。
'Kenneth Cole is a brand name.' v.s. 'Can I eat something now?' v.s. 'An English man had tea'
在这种情况下,这个问题相当棘手,所以如果没有其他标准来识别专有名词,比如黑名单、数据库等等,那就不容易了。regex
确实很厉害,但我觉得它不能以简单的方式理解英语的语法……
话虽如此,祝你好运!
1
你想做的事情在自然语言处理领域被称为“命名实体识别”。如果你真的想找到专有名词,那你可能需要考虑使用命名实体识别的方法。幸运的是,nltk
这个库里有一些很简单易用的功能可以帮助你:
import nltk
s2 = 'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole'
tokens2 = nltk.word_tokenize(s2)
tags = nltk.pos_tag(tokens2)
res = nltk.ne_chunk(tags)
结果:
res.productions()
Out[8]:
[S -> ('at', 'IN') ('the', 'DT') ORGANIZATION ('national', 'JJ') ('conference', 'NN') (',', ',') ORGANIZATION ('and', 'CC') ('fashion', 'NN') ('designer', 'NN') PERSON,
ORGANIZATION -> ('Shift', 'NNP') ('Series', 'NNP'),
ORGANIZATION -> ('Compass', 'NNP') ('Partners', 'NNPS'),
PERSON -> ('Kenneth', 'NNP') ('Cole', 'NNP')]