使用正则表达式从字符串中提取信息

3条回答

网友

1楼 · 编辑于 2024-05-29 11:33:16

Tim Pietzcker的解决方案可以简化为（注意，模式也会被修改）：

import re
credits = """   Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"""

# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*")

# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"\s*([^(]*)(?<! )\s*(?:\(([^)]*)\))?")

# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")

pairs = []
for character in splitre.split(credits):
    gr = matchre.match(character).groups('')
    for part in splitparts.split(gr[1]):
        pairs.append((gr[0], part))

print(pairs)

然后：

^{pr2}$

诀窍是将groups('')与参数''一起使用

网友

2楼 · 编辑于 2024-05-29 11:33:16

import re
credits = """Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"""

# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"\s*(?:,(?![^()]*\))|\bwith\b|\band\b)\s*")

# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"([^(]*)(?:\(([^)]*)\))?")

# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")

characters = splitre.split(credits)
pairs = []
for character in characters:
    if character:
        match = matchre.match(character)
        if match:
            actor = match.group(1).strip()
            if match.group(2):
                parts = splitparts.split(match.group(2))
                for part in parts:
                    pairs.append((actor, part))
            else:
                pairs.append((actor, ""))

print(pairs)

输出：

^{pr2}$

网友

3楼 · 编辑于 2024-05-29 11:33:16

您需要的是识别以大写字母开头的单词序列，再加上一些复杂的情况（例如，您不能假设每个名字都是由name-surface组成的，但也可以是name-minus Jr.，或name.minus，或其他本地化变体，Jean-Claude van Damme，Louis da Silva，等等）。在

现在，对于您发布的示例输入来说，这可能有点过头了，但是正如我在上面所写的，我认为事情很快就会变得一团糟，所以我将使用nltk来处理这个问题。在

下面是一个非常粗糙、测试不太好的代码片段，但它应该能做到：

import nltk
from nltk.chunk.regexp import RegexpParser

_patterns = [
    (r'^[A-Z][a-zA-Z]*[A-Z]?[a-zA-Z]+.?$', 'NNP'),  # proper nouns
    (r'^[(]$', 'O'),
    (r'[,]', 'COMMA'),
    (r'^[)]$', 'C'),
    (r'.+', 'NN')                                   # nouns (default)
]

_grammar = """
        NAME: {<NNP> <COMMA> <NNP>}
        NAME: {<NNP>+}
        ROLE: {<O> <NAME>+ <C>}
        """    
text = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"
tagger = nltk.RegexpTagger(_patterns)    
chunker = RegexpParser(_grammar)
text = text.replace('(', '( ').replace(')', ' )').replace(',', ' , ')
tokens = text.split()
tagged_text = tagger.tag(tokens)
tree = chunker.parse(tagged_text)

for n in tree:
    if isinstance(n, nltk.tree.Tree) and n.node in ['ROLE', 'NAME']: 
        print n

# output is:
# (NAME Will/NNP Ferrell/NNP)
# (ROLE (/O (NAME Nick/NNP Halsey/NNP) )/C)
# (NAME Rebecca/NNP Hall/NNP)
# (ROLE (/O (NAME Samantha/NNP) )/C)
# (NAME Glenn/NNP Howerton/NNP)
# (ROLE (/O (NAME Gary/NNP ,/COMMA Brad/NNP) )/C)
# (NAME Stephen/NNP Root/NNP)
# (NAME Laura/NNP Dern/NNP)
# (ROLE (/O (NAME Delilah/NNP ,/COMMA Stacy/NNP) )/C)

然后，您必须处理标记的输出，并将名称和角色放入列表中，而不是打印，但是您得到了图片。在

我们在这里要做的是，首先根据正则表达式in_模式标记每个令牌，然后根据您的简单语法进行第二遍构建更复杂的块。您可以根据需要将语法和模式复杂化，例如，捕捉名称的变化、混乱的输入、缩写等等。在

我认为用一个regex过程来实现这一点对于非平凡的输入来说将是一件痛苦的事情。在

否则，Tim's solution很好地解决了您发布的输入的问题，并且没有nltk依赖关系。在

相关问题更多 >

编程相关推荐

热门问题

热门文章