使用正则表达式捕获1词和2词专有名词

1 投票
3 回答
1028 浏览
提问于 2025-04-17 22:53

我想到了下面的内容。我把问题缩小到无法同时捕捉到1个词和2个词的专有名词。

(1) 如果我能设置一个条件,让程序在有两个选择时默认选择较长的词,那就太好了。

而且

(2) 如果我能告诉正则表达式,只在字符串以介词开头时考虑这个,比如“On”、“At”或“For”。我试过一些类似的东西,但没有成功:

(^On|^at)([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,5})

我该怎么做1和2呢?

我现在的正则表达式是

r'([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,15})'

我想捕捉到的有:Ashoka、Shift Series、Compass Partners和Kenneth Cole。

#'On its 25th anniversary, Ashoka',

#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole',

3 个回答

1

我建议使用一个自然语言处理工具,在Python中最受欢迎的似乎是nltk。其实,正则表达式并不是处理这个问题的好方法……在nltk网站的首页上有一个例子,之前的回答中也提到过,下面是复制粘贴的内容:

import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)    
tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)

现在,entities里包含了根据宾州树库标记的单词。

1

这不是完全正确,但大部分你想要的内容都能匹配到,除了On这个。

import re
text = """
#'On its 25th anniversary, Ashoka',

#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth     
Cole',
"""
proper_noun_regex = r'([A-Z]{1}[a-z]{1,}(\s[A-Z]{1}[a-z]{1,})?)'
p = re.compile(proper_noun_regex)
matches = p.findall(text)

print matches

输出结果:

[('On', ''), ('Ashoka', ''), ('Shift Series', ' Series'), ('Compass Partners', ' Partners'), ('Kenneth Cole', ' Cole')]

然后你可以考虑实现一个过滤器来处理这个列表。

def filter_false_positive(unfiltered_matches):
    filtered_matches = []
    black_list = ["an","on","in","foo","bar"] #etc
    for match in unfiltered_matches:
        if match.lower() not in black_list:
            filtered_matches.append(match)
    return filtered_matches

或者因为Python很酷:

def filter_false_positive(unfiltered_matches):
    black_list = ["an","on","in","foo","bar"] #etc
    return [match for match in filtered_matches if match.lower() not in black_list]

你可以这样使用它:

# CONTINUED FROM THE CODE ABOVE
matches = [i[0] for i in matches]
matches = filter_false_positive(matches)
print matches

最终输出结果是:

['Ashoka', 'Shift Series', 'Compass Partners', 'Kenneth Cole']

判断一个单词是因为在句子开头而大写,还是因为它是专有名词,这个问题并不简单。

'Kenneth Cole is a brand name.' v.s. 'Can I eat something now?' v.s. 'An English man had tea'

在这种情况下,这个问题相当棘手,所以如果没有其他标准来识别专有名词,比如黑名单、数据库等等,那就不容易了。regex确实很厉害,但我觉得它不能以简单的方式理解英语的语法……

话虽如此,祝你好运!

1

你想做的事情在自然语言处理领域被称为“命名实体识别”。如果你真的想找到专有名词,那你可能需要考虑使用命名实体识别的方法。幸运的是,nltk这个库里有一些很简单易用的功能可以帮助你:

import nltk
s2 = 'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole'
tokens2 = nltk.word_tokenize(s2)
tags = nltk.pos_tag(tokens2)
res = nltk.ne_chunk(tags)

结果:

res.productions()
Out[8]: 
[S -> ('at', 'IN') ('the', 'DT') ORGANIZATION ('national', 'JJ') ('conference', 'NN') (',', ',') ORGANIZATION ('and', 'CC') ('fashion', 'NN') ('designer', 'NN') PERSON,
 ORGANIZATION -> ('Shift', 'NNP') ('Series', 'NNP'),
 ORGANIZATION -> ('Compass', 'NNP') ('Partners', 'NNPS'),
 PERSON -> ('Kenneth', 'NNP') ('Cole', 'NNP')]

撰写回答