Python中正则表达式可选匹配失败

2 投票

3 回答

3769 浏览

提问于 2025-04-15 22:18

tickettypepat = (r'MIS Notes:.*(//p//)?.*')
retype = re.search(tickettypepat,line)
if retype:
  print retype.group(0)
  print retype.group(1)

给定这个输入。

MIS Notes: //p//

有没有人能告诉我为什么 group(0) 是

MIS Notes: //p//

而 group(1) 返回的是 None？

我最开始使用正则表达式是因为在遇到问题之前，匹配的内容比单纯匹配 //p// 要复杂得多。这是完整的代码。我对这些还比较陌生，所以请原谅我的菜鸟水平，我相信有更好的方法来完成这些事情。如果有人愿意指出来，那就太好了。不过，除了正则表达式对于 //[pewPEW]// 的匹配太贪心的问题，其他部分似乎都能正常工作。非常感谢大家的帮助。

处理文本并清理/转换一些内容。

filename = (r'.\4-12_4-26.txt')
import re
import sys
#Clean up output from the web to ensure that you have one catagory per line
f = open(filename)
w = open('cleantext.txt','w')

origdatepat = (r'(Ticket Date: )([0-9]+/[0-9]+/[0-9]+),( [0-9]+:[0-9]+ [PA]M)')
tickettypepat = (r'MIS Notes:.*(//[pewPEW]//)?.*')

print 'Begining Blank Line Removal'
for line in f:
    redate = re.search(origdatepat,line)
    retype = re.search(tickettypepat,line)
    if line == ' \n':
        line = ''
        print 'Removing blank Line'
#remove ',' from time and date line    
    elif redate:
        line = redate.group(1) + redate.group(2)+ redate.group(3)+'\n'
        print 'Redating... ' + line

    elif retype:
        print retype.group(0)
        print retype.group(1)
        
        if retype.group(1) == '//p//':
            line = line + 'Type: Phone\n'
            print 'Setting type for... ' + line
        elif retype.group(1) == '//e//':
            line = line + 'Type: Email\n'
            print 'Setting type for... ' + line
        elif retype.group(1) == '//w//':
            line = line + 'Type: Walk-in\n'
            print 'Setting type for... ' + line
        elif retype.group(1) == ('' or None):
            line = line + 'Type: Ticket\n'
            print 'Setting type for... ' + line

    w.write(line)

print 'Closing Files'                 
f.close()
w.close()

这是一些示例输入。

Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes: some random stuff //p// followed by more stuff
Key Words:  

Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes: //p//
Key Words:  

Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes: //e// stuff....
Key Words:  


Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes:
Key Words:

正则表达式文本处理错误调试匹配贪心匹配组捕获输入示例内容转换

3 个回答

这个模式对于你的需求来说有点模糊。把它们按前缀或后缀分组会更好。在这里的例子中，我选择了按前缀分组。简单来说，如果这一行中出现了//p//，那么前缀就是不为空的。后缀就是在//p//之后的所有内容，如果没有这个标记，后缀就是整行的内容。

import re
lines = ['MIS Notes: //p//',
    'MIS Notes: prefix//p//suffix']

tickettypepat = (r'MIS Notes: (?:(.*)//p//)?(.*)')
for line in lines:
    m = re.search(tickettypepat,line)
    print 'line:', line
    if m: print 'groups:', m.groups()
    else: print 'groups:', m

结果：

line: MIS Notes: //p//
groups: ('', '')
line: MIS Notes: prefix//p//suffix
groups: ('prefix', 'suffix')

回答于 2025-04-15 由 Python大师

分享举报

正则表达式是贪婪的，这意味着 .* 会尽可能多地匹配字符，甚至会匹配整个字符串。所以在这种情况下，后面的可选组就没有东西可以匹配了。group(0) 总是返回整个匹配到的字符串。

根据你的评论，你为什么还想用正则表达式呢？难道像下面这样的写法就足够了吗：

if line.startswith('MIS Notes:'): # starts with that string
    data = line[len('MIS Notes:'):] # the rest in the interesting part
    if '//p//' in data:
        stuff, sep, rest = data.partition('//p//') # or sothing like that
    else:
        pass #other stuff

回答于 2025-04-15 由 Python大师

分享举报

这个正则表达式 MIS Notes:.*(//p//)?.* 的工作原理是这样的，以 "MIS Notes: //p//" 为例：

MIS Notes: 匹配 "MIS Notes:"，这没什么好惊讶的。
.* 直接跑到字符串的末尾（到目前为止匹配的是 "MIS Notes: //p//"）
(//p//)? 是可选的。这里没有任何事情发生。
.* 没有东西可以匹配了，我们已经到字符串的末尾了。因为星号允许前面的部分不匹配，所以正则引擎停止将整个字符串报告为匹配，并且子组是空的。

现在当你把正则表达式改成 MIS Notes:.*(//p//).* 时，行为就变了：

MIS Notes: 仍然匹配 "MIS Notes:"，这依然没什么好惊讶的。
.* 直接跑到字符串的末尾（到目前为止匹配的是 "MIS Notes: //p//"）
(//p//) 是必须的。引擎开始逐个字符回溯，以满足这个要求。（到目前为止匹配的是 "MIS Notes: "）
(//p//) 可以匹配。子组一被保存，包含 "//p//"。
.* 跑到字符串的末尾。提示：如果你对它匹配的内容不感兴趣，这部分是多余的，可以去掉。

现在如果你把正则表达式改成 MIS Notes:.*?//(p)//，行为又会改变：

MIS Notes: 仍然匹配 "MIS Notes:"，这也没什么好惊讶的。
.*? 是非贪婪的，会在继续之前检查后面的部分（到目前为止匹配的是 "MIS Notes: "）
//(p)// 可以匹配。子组一被保存，包含 "p"。
完成了。注意没有发生回溯，这样节省了时间。

如果你知道在 //p// 前面不会有 /，你可以使用：MIS Notes:[^/]*//(p)//：

MIS Notes: 匹配 "MIS Notes:"，你明白了。
[^/]* 可以快速跳到第一个斜杠（这比 .*? 更快）
//(p)// 可以匹配。子组一被保存，包含 "p"。
完成了。注意没有发生回溯，这样节省了时间。这应该比第3种情况更快。

回答于 2025-04-15 由 Python大师

分享举报

Python中正则表达式可选匹配失败

处理文本并清理/转换一些内容。

3 个回答

撰写回答