使用re.finditer和re.match时行为不同

2 投票

3 回答

2068 浏览

提问于 2025-04-16 09:42

我正在写一个正则表达式，用来从网页中提取一些值。我在条件判断中使用了 re.match，但是它返回了假（false），而如果我用 finditer，它就返回真（true），然后条件里的代码就会执行。我在自己写的测试工具中测试了这个正则表达式，结果是可以正常工作的，但在我的脚本中却不行。以下是我的示例脚本。

result = []
RE_Add0 = re.compile("\d{5}(?:(?:-| |)\d{4})?", re.IGNORECASE)
each = ''Expiration Date:\n05/31/1996\nBusiness Address: 23901 CALABASAS ROAD #2000 CALABASAS, CA 91302\n'
if RE_Add0.match(each):
    result0 = RE_Add0.match(each).group(0)
    print result0
    if len(result0) < 100:
        result.append(result0)
    else:
        print 'Address ignore'
else:
    None

正则表达式条件判断 re.finditer re.match 网页数据提取

3 个回答

试试这个：

import re

postalCode = re.compile(r'((\d{5})([ -])?(\d{4})?(\s*))$')
primaryGroup = lambda x: x[1]

sampleStr = """
    Expiration Date:
    05/31/1996
    Business Address: 23901 CALABASAS ROAD #2000 CALABASAS, CA 91302  
"""
result = []

matches = list(re.findall(postalCode, sampleStr))
if matches:
    for n,match in enumerate(matches): 
        pc = primaryGroup(match)
        print pc
        result.append(pc)
else:
    print "No postal code found in this string"

这个在任何情况下都会返回 '12345'

12345\n
12345  \n
12345 6789\n
12345 6789    \n
12345 \n
12345     \n
12345-6789\n
12345-6789    \n
12345-\n
12345-    \n
123456789\n
123456789    \n
12345\n
12345    \n

我让它只在行的末尾进行匹配，因为如果不这样做，它还会匹配到你例子中的 '23901'（来自街道地址）。

回答于 2025-04-16 由 Python大师

分享举报

re.match 是用来检查字符串开头的内容，只会匹配一次。而 re.finditer 和 re.search 有点像，它们都是逐步匹配的。来看看这两者的区别：

>>> re.match('a', 'abc')
<_sre.SRE_Match object at 0x01057AA0>
>>> re.match('b', 'abc')
>>> re.finditer('a', 'abc')
<callable_iterator object at 0x0106AD30>
>>> re.finditer('b', 'abc')
<callable_iterator object at 0x0106EA10>

补充说明：既然你提到了 page，我猜你是在说解析html。如果是这样的话，建议使用 BeautifulSoup 或类似的html解析工具，不要用正则表达式。

回答于 2025-04-16 由 Python大师

分享举报

re.finditer() 会返回一个迭代器对象，即使没有找到匹配的内容（所以用 if RE_Add0.finditer(each) 这样的写法总是会返回 True）。你需要实际去遍历这个对象，才能知道里面有没有真正的匹配。

接下来，re.match() 只会在字符串的开头进行匹配，而不是像 re.search() 或 re.finditer() 那样可以在字符串的任何位置进行匹配。

第三，那个正则表达式可以写成 r"\d{5}(?:[ -]?\d{4})"。

最后，使用正则表达式时，最好总是用原始字符串。

回答于 2025-04-16 由 Python大师

分享举报

使用re.finditer和re.match时行为不同

3 个回答

撰写回答