re.findall(‘(ab | cd)’,string)与re.findall(‘(ab | cd)+’,string)

2024-05-14 14:01:42 发布

您现在位置:Python中文网/ 问答频道 /正文

在Python正则表达式中,我遇到了这个奇异的问题。 你能说明一下re.findall('(ab|cd)', string)re.findall('(ab|cd)+', string)之间的区别吗

import re

string = 'abcdla'
result = re.findall('(ab|cd)', string)
result2 = re.findall('(ab|cd)+', string)
print(result)
print(result2)

实际产出为:

['ab', 'cd']
['cd']

我不明白为什么第二个结果也不包含'ab'


Tags: importrestringabcdresultprint区别
3条回答

所以,让我困惑的是

If one or more groups are present in the pattern, return a list of groups;

docs

所以它返回给你的不是一个完整的匹配,而是一个捕获的匹配。如果您使该组不捕获(re.findall('(?:ab|cd)+', string),它将返回我最初期望的["abcd"]

我不知道这是否会让事情变得更清楚,但让我们试着用一种简单的方式想象一下引擎盖下会发生什么, 我们将使用match来了解发生了什么

   # group(0) return the matched string the captured groups are returned in groups or you can access them
   # using group(1), group(2).......  in your case there is only one group, one group will capture only 
   # one part so when you do this
   string = 'abcdla'
   print(re.match('(ab|cd)', string).group(0))  # only 'ab' is matched and the group will capture 'ab'
   print(re.match('(ab|cd)+', string).group(0)) # this will match 'abcd'  the group will capture only this part 'cd' the last iteration

findall同时匹配并使用字符串让我们想象一下这个正则表达式会发生什么'(ab|cd)'

      'abcdabla'  -> 1:   match: 'ab' |  capture : ab  | left to process:  'cdabla'
      'cdabla'    -> 2:   match: 'cd' |  capture : cd  | left to process:  'abla'
      'abla'      -> 3:   match: 'ab' |  capture : ab  | left to process:  'la'
      'la'        -> 4:   match: '' |  capture : None  | left to process:  ''

       - final : result captured ['ab', 'cd', 'ab']  

现在'(ab|cd)+'也是这样

      'abcdabla'  -> 1:   match: 'abcdab' |  capture : 'ab'  | left to process:  'la'
      'la'        -> 2:   match: '' |  capture : None  | left to process:  ''
       -> final result :   ['ab']  

我希望这件事能澄清一点

+是匹配一次或多次的重复量词。在regex(ab|cd)+中,您使用+重复捕获组(ab|cd)。这将只捕获最后一次迭代

您可以对这种行为进行如下推理:

假设您的字符串是abcdla,正则表达式是(ab|cd)+。正则表达式引擎将在位置0和位置1之间找到组的匹配项作为ab,并退出捕获组。然后它看到+量词,因此再次尝试捕获组,并将捕获位置2和3之间的cd


如果要捕获所有迭代,则应使用与abcdcd匹配的((ab|cd)+)来捕获重复组。您可以使内部组不被捕获,因为我们不关心与((?:ab|cd)+)匹配的内部组匹配abcd

https://www.regular-expressions.info/captureall.html

从文件来看

Let’s say you want to match a tag like !abc! or !123!. Only these two are possible, and you want to capture the abc or 123 to figure out which tag you got. That’s easy enough: !(abc|123)! will do the trick.

Now let’s say that the tag can contain multiple sequences of abc and 123, like !abc123! or !123abcabc!. The quick and easy solution is !(abc|123)+!. This regular expression will indeed match these tags. However, it no longer meets our requirement to capture the tag’s label into the capturing group. When this regex matches !abc123!, the capturing group stores only 123. When it matches !123abcabc!, it only stores abc.

相关问题 更多 >

    热门问题