使用正则表达式获取连续的大写单词

2024-06-06 21:55:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我的正则表达式无法捕获连续的大写单词。 下面是我希望regex捕获的内容:

"said Polly Pocket and the toys" -> Polly Pocket

这是我正在使用的正则表达式:

re.findall('said ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)', article)

它返回以下内容:

[('Polly Pocket', ' Pocket')]

我要它回来:

['Polly Pocket']

Tags: andthere内容article单词regexpocket
3条回答

这是因为findall返回正则表达式中的所有捕获组,并且有两个捕获组(一个获取所有匹配文本,另一个获取后续单词的内部捕获组)。

您只需使用(?:regex)而不是(regex),就可以将第二个捕获组变成非捕获组:

re.findall('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', article)

积极展望未来:

([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)

断言当前单词要被接受,需要后面跟着另一个带有大写字母的单词。分解:

(                # begin capture
  [A-Z]            # one uppercase letter  \ First Word
  [a-z]+           # 1+ lowercase letters  /
  (?=\s[A-Z])      # must have a space and uppercase letter following it
  (?:                # non-capturing group
    \s               # space
    [A-Z]            # uppercase letter   \ Additional Word(s)
    [a-z]+           # lowercase letter   /
  )+              # group can be repeated (more words)
)               #end capture
$mystring = "the United States of America has many big cities like New York and Los Angeles, and others like Atlanta";

@phrases = $mystring =~ /[A-Z][\w'-]\*(?:\s+[A-Z][\w'-]\*)\*/g;

print "\n" . join(", ", @phrases) . "\n\n# phrases = " . scalar(@phrases) . "\n\n";

输出:

$ ./try_me.pl

United States, America, New York, Los Angeles, Atlanta

\# phrases = 5

相关问题 更多 >