使用正则表达式捕获1词和2词专有名词

1 投票

3 回答

1028 浏览

数据工程师

提问于 2025-04-17 22:53

我想到了下面的内容。我把问题缩小到无法同时捕捉到1个词和2个词的专有名词。

(1) 如果我能设置一个条件，让程序在有两个选择时默认选择较长的词，那就太好了。

而且

(2) 如果我能告诉正则表达式，只在字符串以介词开头时考虑这个，比如“On”、“At”或“For”。我试过一些类似的东西，但没有成功：

(^On|^at)([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,5})

我该怎么做1和2呢？

我现在的正则表达式是

r'([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,15})'

我想捕捉到的有：Ashoka、Shift Series、Compass Partners和Kenneth Cole。

#'On its 25th anniversary, Ashoka',

#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole',

正则表达式字符串处理文本解析条件匹配专有名词词捕获介词

3 个回答

我建议使用一个自然语言处理工具，在Python中最受欢迎的似乎是nltk。其实，正则表达式并不是处理这个问题的好方法……在nltk网站的首页上有一个例子，之前的回答中也提到过，下面是复制粘贴的内容：

import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)    
tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)

现在，entities里包含了根据宾州树库标记的单词。

回答于 2025-04-17 由 Python大师

分享举报

这不是完全正确，但大部分你想要的内容都能匹配到，除了On这个。

import re
text = """
#'On its 25th anniversary, Ashoka',

#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth     
Cole',
"""
proper_noun_regex = r'([A-Z]{1}[a-z]{1,}(\s[A-Z]{1}[a-z]{1,})?)'
p = re.compile(proper_noun_regex)
matches = p.findall(text)

print matches

输出结果：

[('On', ''), ('Ashoka', ''), ('Shift Series', ' Series'), ('Compass Partners', ' Partners'), ('Kenneth Cole', ' Cole')]

然后你可以考虑实现一个过滤器来处理这个列表。

def filter_false_positive(unfiltered_matches):
    filtered_matches = []
    black_list = ["an","on","in","foo","bar"] #etc
    for match in unfiltered_matches:
        if match.lower() not in black_list:
            filtered_matches.append(match)
    return filtered_matches

或者因为Python很酷：

def filter_false_positive(unfiltered_matches):
    black_list = ["an","on","in","foo","bar"] #etc
    return [match for match in filtered_matches if match.lower() not in black_list]

你可以这样使用它：

# CONTINUED FROM THE CODE ABOVE
matches = [i[0] for i in matches]
matches = filter_false_positive(matches)
print matches

最终输出结果是：

['Ashoka', 'Shift Series', 'Compass Partners', 'Kenneth Cole']

判断一个单词是因为在句子开头而大写，还是因为它是专有名词，这个问题并不简单。

'Kenneth Cole is a brand name.' v.s. 'Can I eat something now?' v.s. 'An English man had tea'

在这种情况下，这个问题相当棘手，所以如果没有其他标准来识别专有名词，比如黑名单、数据库等等，那就不容易了。regex确实很厉害，但我觉得它不能以简单的方式理解英语的语法……

话虽如此，祝你好运！

回答于 2025-04-17 由 Python大师

分享举报

你想做的事情在自然语言处理领域被称为“命名实体识别”。如果你真的想找到专有名词，那你可能需要考虑使用命名实体识别的方法。幸运的是，nltk这个库里有一些很简单易用的功能可以帮助你：

import nltk
s2 = 'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole'
tokens2 = nltk.word_tokenize(s2)
tags = nltk.pos_tag(tokens2)
res = nltk.ne_chunk(tags)

结果：

res.productions()
Out[8]: 
[S -> ('at', 'IN') ('the', 'DT') ORGANIZATION ('national', 'JJ') ('conference', 'NN') (',', ',') ORGANIZATION ('and', 'CC') ('fashion', 'NN') ('designer', 'NN') PERSON,
 ORGANIZATION -> ('Shift', 'NNP') ('Series', 'NNP'),
 ORGANIZATION -> ('Compass', 'NNP') ('Partners', 'NNPS'),
 PERSON -> ('Kenneth', 'NNP') ('Cole', 'NNP')]

回答于 2025-04-17 由 Python大师

分享举报

使用正则表达式捕获1词和2词专有名词

3 个回答

撰写回答