提取文本中名词的正则表达式

0 投票

3 回答

1039 浏览

提问于 2025-04-18 07:10

我在使用Python的正则表达式时遇到了一些问题。
我有一段标注了词性的文本，格式如下：

('play', 'NN')|('2', 'CD')|('dvd', 'NN')|('2', 'CD')|('for', 'IN')|('instance', 'NN')|('i', 'PRP')|('made', 'VBD')|('several', 'JJ')|('back', 'NN')|('ups', 'NNS')|('of', 'IN')|('my', 'PRP$')|('dvd', 'NN')|('movies', 'NNS')|('using', 'VBG')|('dvd', 'NN')|('r', 'NN')|('w', 'NN')|('and', 'CC')|('r', 'NN')|('w', 'NN')|('and', 'CC')|('it', 'PRP')|('plays', 'VBZ')|('the', 'DT')|('dvds', 'NNS')

我想做的是从这段文本中提取出所有的名词，并且所有相邻的名词（中间没有其他词）应该放在同一个字符串里。所有以NN开头的标签都是名词。这里是我写的正则表达式：

re.compile(r"(\|?\([\'|\"][\w]+[\'|\"]\, \'NN\w?\'\)\|?)+")

我刚开始写正则表达式，所以表达式可能有点乱，下面是它生成的输出：

["('play', 'NN')|", "|('dvd', 'NN')|", "|('instance', 'NN')|", "('ups', 'NNS')|", "('movies', 'NNS')|", "('w', 'NN')|", "('w', 'NN')|"]

我希望像'back ups'和'dvd movies'这样的词在文本中出现时，也能在输出中一起出现。

我哪里做错了，有人能给我建议吗！

正则表达式字符串处理自然语言处理文本分析词性标注语言模型名词提取

3 个回答

你可以不使用正则表达式来完成这个吗？难道你只是想解析一些文本吗？

感谢mgilson的评论，内容已更新。

import ast
nouns = []
for word_and_tag in pos_tagged_words.split("|"):
    word, tag = ast.literal_eval(word_and_tag)
    if tag.startswith("NN"):
        #do something?
        #probably this...
        nouns.append(word)

#use nouns

回答于 2025-04-18 由 Python大师

分享举报

你可以用itertools做一些很酷的事情。假设你可以稳定地把单词用|分开：

def word_yielder(word_str):
    tuples = (ast.literal_eval(t) for t in word_str.split('|'))
    for key, group in itertools.groupby(tuples, key=lambda t: t[1].startswith('NN')):
        if key:  # Have a group of nouns, join them together.
            yield (' '.join(t[0] for t in group), 'NN')
        else:  # Have a group of non-nouns
            for t in group:  # python3.x -- yield from :-)
                yield t

回答于 2025-04-18 由 Python大师

分享举报

这里有一个使用pyparsing的解决方案：

from pyparsing import *

LPAR,RPAR,COMMA,VERT,QUOT = map(Suppress,"(),|'")
nountype = Combine(QUOT + "NN" + Optional(Word(alphas)) + QUOT)

nounspec = LPAR + quotedString.setParseAction(removeQuotes) + COMMA + nountype + RPAR

# match all nounspec's that have one or more separated by '|'s
noungroup = delimitedList(nounspec, delim=VERT)

# join the nouns, and return a new tuple when a nounspec list is found
noungroup.setParseAction(lambda tokens: (' '.join(tokens[0::2]), tokens[1]) )

# parse sample text
sample = """('play', 'NN')|('2', 'CD')|('dvd', 'NN')|('2', 'CD')|('for', 'IN')|('instance', 'NN')|('i', 'PRP')|('made', 'VBD')|('several', 'JJ')|('back', 'NN')|('ups', 'NNS')|('of', 'IN')|('my', 'PRP$')|('dvd', 'NN')|('movies', 'NNS')|('using', 'VBG')|('dvd', 'NN')|('r', 'NN')|('w', 'NN')|('and', 'CC')|('r', 'NN')|('w', 'NN')|('and', 'CC')|('it', 'PRP')|('plays', 'VBZ')|('the', 'DT')|('dvds', 'NNS')"""
print sum(noungroup.searchString(sample)).asList()

输出结果是：

[('play', 'NN'), ('dvd', 'NN'), ('instance', 'NN'), ('back ups', 'NN'), ('dvd movies', 'NN'), ('dvd r w', 'NN'), ('r w', 'NN'), ('dvds', 'NNS')]

回答于 2025-04-18 由 Python大师

分享举报

提取文本中名词的正则表达式

3 个回答

撰写回答